Building A Distributed Search Engine With Inverted Index: Term Partitioning And Decentralized Crawling

2024-06-28 · Leonardo Benicio
Summary

A comprehensive technical exploration of building a distributed search engine with inverted index: term partitioning and decentralized crawling, covering key concepts, practical implementations, and real-world applications.

Contents

Reflections on Search: How Term Partitioning and Decentralized Crawling Tame the Web

Introduction: Why We Need a Distributed Search Engine – And How Term Partitioning Makes It Possible

The world has become unthinkable without search. Every second, billions of queries flow through systems like Google, Bing, or DuckDuckGo, seeking answers from a web that now exceeds 50 billion indexed pages (and countless more outside that index). When you type a phrase and hit Enter, the expectation is not just speed – it’s near-instantaneous relevance, zero downtime, and results that somehow understand your intent even if your keywords are imprecise. Behind that magic lies one of the most elegant data structures in computer science: the inverted index. But even a perfect inverted index, stored on a single machine, would collapse under the weight of today’s internet. The web is too vast, too dynamic, and too unpredictable for any centralized system to handle alone. This is where distributed systems come in – and where the decisions about how to split the index become as critical as the index itself.

Most engineers who have built even a small toy search engine – indexing a few thousand documents – know the basic recipe: crawl pages, tokenize text, build an inverted map from words to document IDs, and then rank results using TF-IDF or BM25. Scaling that to billions of documents, however, introduces a chorus of new problems. Network partitions, hardware failures, load imbalances, and the sheer cost of data transfer force us to rethink every assumption. Should we partition by document (sharding the document collection and having each node build a local index) or by term (splitting the vocabulary across nodes so that each node owns a subset of all terms)? This blog post dives deep into the second approach – term partitioning – and explores how it interacts with decentralized crawling to create a search engine that can scale horizontally while maintaining low latency and high freshness.

But before we get into the nitty-gritty, let’s rewind and ask: why does a centralized solution fail so profoundly? Imagine you have a single server with 1 TB of RAM and a 100 Gbps network link. Even if you could store the entire web’s index (say 10 billion documents, each with an average of 100 unique terms, leading to a trillion postings entries), the memory alone would be far beyond what a single machine can hold. Add the fact that every query requires intersecting postings lists across many terms, sorting millions of candidates, and doing it in under 200 milliseconds – and you see that a single machine is physically impossible. Worse, a single point of failure means the entire search engine goes down if that machine crashes. Distributed systems are not optional; they are the only way forward.

In this post, we will first revisit the fundamentals of the inverted index, then dissect the two major partitioning strategies: document partitioning (sharding) and term partitioning (also called global index partitioning). We will give term partitioning a thorough treatment: its architecture, how it interacts with decentralized crawling, query execution pipelines, fault tolerance, load balancing, and real-world trade-offs. Throughout, we’ll use concrete examples, small code snippets, and even a mini case study of building a toy distributed search engine to illustrate the concepts. By the end, you will understand why term partitioning was a key enabler for early web-scale search engines and how it remains relevant in modern systems that need low latency and fresh results.


1. The Inverted Index: The Heart of Search

Before we talk about distribution, let’s make sure we are all on the same page about the inverted index. In its simplest form, an inverted index is a mapping from terms (words, tokens) to the list of documents that contain those terms, along with optional positional and frequency information.

Example: Suppose we have three documents:

  • Doc1: “the quick brown fox”
  • Doc2: “the lazy dog”
  • Doc3: “the quick brown fox jumps over the lazy dog”

A simple inverted index (without positions) would look like:

"brown" -> (Doc1, 1) (Doc3, 1)
"dog"   -> (Doc2, 1) (Doc3, 1)
"fox"   -> (Doc1, 1) (Doc3, 1)
"jumps" -> (Doc3, 1)
"lazy"  -> (Doc2, 1) (Doc3, 1)
"over"  -> (Doc3, 1)
"quick" -> (Doc1, 1) (Doc3, 1)
"the"   -> (Doc1, 1) (Doc2, 1) (Doc3, 1)

Each entry is a posting list. For large-scale systems, these postings are compressed (using techniques like variable-length encoding, delta encoding, or bitmaps) and stored on disk or in memory. When a user queries for “quick fox”, the search engine looks up the posting lists for “quick” and “fox”, and intersects them (or merges them for OR queries). The result is a set of candidate documents (here, Doc1 and Doc3), which are then ranked.

Ranking functions like TF-IDF (term frequency–inverse document frequency) or BM25 compute a score for each document based on how many times the query terms appear, how rare the terms are in the whole collection, and document length normalization. For the query “quick fox”, Doc1 has one occurrence of “quick” and one of “fox”, while Doc3 also has them. But because “the” is very common, its IDF is low, so Doc3 might be favored if it contains multiple query terms in a higher proportion.

The beauty of the inverted index is that it turns a keyword search problem into a set intersection problem. But when the index is huge, even intersection becomes expensive if the posting lists are long. That’s where partitioning helps.


2. Scaling the Index: The Two Fundamental Strategies

When we want to scale a search engine beyond a single machine, we have to split the data. There are two main approaches:

2.1 Document Partitioning (Sharding)

In document partitioning, we split the document collection into N shards (for example, by hash of document ID). Each shard is stored on a separate node, and each node builds a complete inverted index locally for its subset of documents. When a query arrives, it is broadcast to all nodes. Each node runs the query against its local index, returns its top-K results, and then a central aggregator merges the results from all nodes to produce the final top-K list.

Pros:

  • Simple to implement: each node is essentially a mini search engine.
  • Adding more nodes increases capacity linearly (more documents can be stored).
  • Query latency is roughly the same as a single-node search (since all nodes run in parallel, the slowest node determines the overall latency).

Cons:

  • Query fan-out: every query touches every node, even for rare terms. That wastes resources.
  • Global statistics (like document frequency for IDF) are harder to compute correctly. IDF is normally computed over the entire collection. With document partitioning, each node uses local IDF, which can be very inaccurate for terms that are skewed. This is often mitigated by maintaining a separate global statistics server or by using approximate methods.
  • Merging top-K results from N nodes is straightforward but can be imperfect: a document that is the 11th best in its shard might be better than the 10th best in another shard, leading to missed top-N results. Techniques like distributed top-K with threshold algorithms (e.g., Fagin’s TA) are used but add complexity.

Document partitioning is the most common approach in modern search engines. Elasticsearch, Solr, and early Google (the Google File System and MapReduce phase) use document sharding. It’s robust and relatively easy to reason about.

2.2 Term Partitioning (Global Index Partitioning)

Term partitioning takes a different route: instead of splitting documents, we split the vocabulary. Each node (or set of nodes) is responsible for a subset of term postings. For example, node A stores posting lists for terms starting with ‘a’-‘m’, node B for ‘n’-‘z’. Alternatively, we can use a hash of the term string to assign terms to nodes.

In a term-partitioned system, a query for “quick fox” must be sent to the nodes that own the terms “quick” and “fox”. Those nodes return their respective posting lists, and then the system intersects (or merges) them. This intersection can be done either at a central coordinator or at the query node.

Pros:

  • Only nodes that own query terms are involved, not all nodes. This reduces network traffic and CPU load for rare-term queries.
  • Global statistics (DF) are naturally maintained per term on its owning node, so IDF is always exact without extra servers.
  • It enables more efficient query execution for multi-term queries: the coordinator can fetch posting lists from a few nodes and intersect them locally, instead of broadcasting to many nodes.

Cons:

  • Load imbalance: some terms (like “the”, “a”, “and”) are extremely common and have huge posting lists. The node responsible for these “high-frequency” terms will be heavily loaded, while nodes with rare terms are idle. This is the skew problem.
  • Adding a node is not trivial: you must redistribute term ranges (or hash ranges) and migrate posting lists.
  • Fault tolerance can be trickier because if the node responsible for a common term goes down, many queries are affected.

Term partitioning was used in early distributed search engines like Harvest and later in Google’s initial architecture (according to some accounts from the late 1990s). However, the skew problem often forced systems to adopt hybrid approaches or to use document partitioning instead.

In the rest of this post, we will focus on term partitioning, but keep in mind that real-world systems often blend both strategies.


3. Deep Dive into Term Partitioning – Architecture and Implementation

Now we will explore the architecture of a term-partitioned search engine in detail. We’ll start with the conceptual components, then move to concrete data structures and algorithms.

3.1 Overall Architecture

A term-partitioned search engine consists of a set of index nodes (also called term servers). Each index node is responsible for a subset of terms. The mapping from term to node is determined by a partition function, which could be:

  • A simple hash: node_id="i415f3c2" mod N
  • A range-based assignment: terms are grouped by first letter or by some key order.
  • A consistent hash ring, so that adding/removing nodes causes minimal remapping.

Additionally, there are crawler nodes that fetch web pages and forward their content (after tokenization) to the appropriate index nodes. There is also a query coordinator (or broker) that receives user queries, determines which index nodes are needed, sends sub-requests to them, and merges the results.

[ Crawlers ] -> tokenized docs -> [ Index Nodes (term servers) ]
                                      ^
[ Query Coordinator ] -> sends term requests -> [ Index Nodes ]
   user query  <------- returns posting lists/partial results

3.2 Index Building with Term Partitioning

When a crawler fetches a page, it must tokenize the page and then send each term’s posting information to the index node responsible for that term. For efficiency, the crawler can batch multiple terms from the same document into a single message. For example, suppose the crawler’s document ID is doc_12345, and the page contains the terms “quick”, “fox”, “brown”. The crawler hashes each term to determine the node, then sends the posting (doc ID, frequency, positions) to each node separately.

Pseudocode for crawler side (simplified in Python-like language):

def process_document(doc_id, text):
    tokens = tokenize(text)  # returns list of (term, frequency, positions)
    # Group by term server
    per_server = {}
    for term, freq, positions in tokens:
        node = hash(term) % NUM_NODES
        per_server.setdefault(node, []).append((term, doc_id, freq, positions))
    for node, postings in per_server.items():
        send_to_node(node, postings)

On the index node side, the node receives these posting entries and builds its local inverted index. Since each node only sees a subset of terms, the local index is essentially a complete inverted index for the terms it owns. The node can store postings in memory or on disk, using compressed representations.

Index node storage:

class IndexNode:
    def __init__(self):
        self.inverted_index = {}  # term -> list of (doc_id, freq, positions)
    def ingest(self, term, doc_id, freq, positions):
        self.inverted_index.setdefault(term, []).append((doc_id, freq, positions))
    def get_posting_list(self, term):
        return self.inverted_index.get(term, [])

3.3 Query Execution

When a user types a multi-word query, the query coordinator must:

  1. Parse the query into terms.
  2. For each term, determine which index node(s) own it (hash or range lookup).
  3. Send a request to each of those nodes asking for the posting list (or for a compressed version).
  4. Once all posting lists are received, the coordinator intersects (AND) or merges (OR) them to get candidate documents.
  5. For each candidate, compute a relevance score (using TF-IDF or BM25) based on the posting list data.
  6. Return the top-K documents to the user.

Steps 4 and 5 can be optimized: Instead of sending the full posting lists, the coordinator can send the top-K scoring documents from each list (if the terms are treated independently), but for AND queries, intersection requires the full lists (or at least the candidate set). For large posting lists (like for “the”), transferring the full list is prohibitive. Therefore, term-partitioned systems often avoid intersecting queries that include stop words; they treat stop words as “filters” that are applied later after the intersection of main terms.

A more efficient approach: use a two-phase algorithm. Phase 1: for each query term, the index node returns the document frequency (DF) and the top few thousand doc IDs (or a sample). The coordinator then identifies which documents appear in all (or most) term’s sample, and sends back a “candidate list” to the nodes for scoring. This is reminiscent of the TA (Threshold Algorithm) and FA (Fagin’s Algorithm).

In a pure term-partitioned system, the coordinator can also choose to forward the entire query to a selected node (e.g., the node owning the rarest term) and have that node perform the intersection locally by requesting posting lists from other nodes. This reduces data transfer: each node only sends its posting list to the rarest-term node, and that node does the heavy work. This is called query forwarding and is similar to how distributed hash tables (like Chord) route queries.

Pseudocode for query coordinator (using query forwarding to the rarest term node):

def handle_query(query_text):
    terms = parse(query_text)
    nodes = [hash(term) % NUM_NODES for term in terms]
    # Assume we have a way to get document frequency per term from the owning node
    rare_term, rare_node = min(terms, key=lambda t: get_doc_freq(t, nodes[terms.index(t)]))
    # Forward the query to rare_node, which will then fetch other terms' lists
    result = rare_node.process_and_query(terms, nodes)
    return result

At the index node (rare_node), it receives the list of terms and their owning nodes. It then requests posting lists from the other nodes, intersects, scores, and returns the top-K results to the coordinator.

3.4 Load Balancing and Skew

The biggest challenge of term partitioning is skew: the “the” node will have a posting list with billions of entries, while the “xylophone” node may have only a few. This node becomes a bottleneck for every query that contains a common term. How do we mitigate this?

  • Term splitting: Terms with extremely high document frequency (head terms) can be split across multiple machines, essentially becoming a miniature document-partitioned index within the term. For “the”, we might assign several nodes to handle it, each responsible for a subset of documents (e.g., hash of doc ID). This creates a hybrid approach: rare terms are single-node, common terms are spread across multiple nodes.
  • Replication: Instead of splitting, replicate the posting lists of common terms on multiple nodes. Queries can then send the “the” request to the least-loaded node among a set. This adds storage overhead but improves throughput.
  • Consistent hashing with virtual nodes: Use consistent hashing with many virtual nodes so that the load distribution is more uniform. However, this doesn’t fix the inherent skew of term popularity; it only makes the mapping granular.
  • Dynamic load shedding: The node serving “the” might only store a partial posting list (e.g., only the top 1 million document IDs with highest PageRank). For a query like “the quick brown fox”, where “the” is a stop word, the system can ignore “the” entirely because it adds little value. Most practical systems treat stop words specially: they are either removed during indexing or are not used for intersection but only for scoring after candidate generation.

Example: In the Google search architecture described in the early 2000s, they used document partitioning for the main index, but for the index of terms that appear in many documents, they used a special “barrel” structure. I recall reading about the “caffeine” indexing system; but in general, Google’s early approach was document partitioning (MapReduce on crawled documents). However, term partitioning was used in the INVERT system and in WAND-based query processing. Most modern industrial search engines (Elasticsearch, Solr) use document partitioning because of the ease of scaling and skew handling.

Nevertheless, term partitioning has its niche: in federated search where each search node specializes in a domain, or in peer-to-peer search engines like Yacy (which uses a DHT-based term partitioning), where each peer stores posts for a subset of terms.


4. Decentralized Crawling Integration

A term-partitioned search engine changes the way crawling works. In a document-partitioned system, a crawler can store documents locally on a node and then that node builds its index. In term partitioning, the crawler must push each term’s posting information to the correct node. This leads to a decentralized crawl and indexing process.

4.1 Crawler Architecture

The web is crawled by multiple crawler nodes, each fetching pages from a queue of URLs. The simplest approach is to have a central URL frontier that dispatches URLs to crawlers. But to avoid single points of failure, we can also make the frontier distributed (using a DHT).

When a crawler downloads a page, it:

  1. Extracts links and adds them to the URL frontier (for future crawling).
  2. Tokenizes the text of the page.
  3. For each term, determines the index node responsible.
  4. Sends the posting (doc ID, term, frequency, positions) to that index node.

This step is analogous to an incremental indexing pipeline. Since the web is constantly changing, crawlers must revisit pages to update their content. The index nodes must support updates: adding new postings, removing old ones, or updating positions. This requires the index to be mutable or use log-structured merge-trees (LSM trees) for efficient updates.

4.2 Handling Updates and Deletions

Term partitioning complicates updates because the same document’s terms are spread across many index nodes. When a page is recrawled, the crawler must send both the new postings and, optionally, a notification to delete the old ones. The index nodes can use version numbers (or timestamps) per document to handle partial updates or tombstone entries.

One common pattern: each index node stores a mapping from (term, doc_id) to a version. When a new posting arrives with a higher version, it overwrites the old. When a page is removed, the crawler sends a deletion message for each term of that page to the respective node.

4.3 Freshness and Crawl Prioritization

Because term partitioning allows each term node to be updated independently, the freshness of results for a query depends on the freshness of the index nodes for the query terms. If a new page contains a rare term, and the corresponding index node has not yet received the posting, the page will not appear in results. For common terms, updates may be delayed due to the high volume.

To improve freshness, crawlers can prioritize recrawling of pages that are frequently queried (user feedback loop) or pages that change often (based on HTTP headers). In a term-partitioned system, it is also beneficial to poll the index nodes to see which terms are “hot” and need more frequent crawling.

4.4 Example: Building a Mini Distributed Search Engine with Term Partitioning

Let’s walk through a simplified implementation to solidify the concepts. We’ll use Python and sockets (or gRPC) for communication. Our system will have:

  • A single coordinator that accepts queries.
  • Two index nodes: node0 for terms hashed to even numbers, node1 for odd numbers.
  • One crawler that simulates fetching documents.

We’ll ignore tons of details (compression, ranking, caching) but focus on the core partitioning logic.

Index Node code (pseudocode):

class IndexNode:
    def __init__(self, node_id):
        self.node_id = node_id
        self.index = defaultdict(list)  # term -> list of (doc_id, freq)

    def add_posting(self, term, doc_id, freq):
        self.index[term].append((doc_id, freq))

    def query(self, term):
        return self.index.get(term, [])

    def handle_message(self, msg):
        if msg['type'] == 'posting':
            self.add_posting(msg['term'], msg['doc_id'], msg['freq'])
        elif msg['type'] == 'query':
            return self.query(msg['term'])

Coordinator:

def get_node_for_term(term):
    return hash(term) % NUM_NODES

def process_query(query_text):
    terms = query_text.split()
    # Collect posting lists from nodes
    nodes_terms = defaultdict(list)
    for term in terms:
        node = get_node_for_term(term)
        nodes_terms[node].append(term)
    results_per_node = {}
    for node, term_list in nodes_terms.items():
        # Send request to node for each term
        for term in term_list:
            resp = send_to_node(node, {'type': 'query', 'term': term})
            results_per_node[(node, term)] = resp
    # Intersect: find documents that appear in all posting lists
    # Simplify: just union for demo
    all_docs = set()
    for (node, term), postings in results_per_node.items():
        docs = {doc_id for doc_id, freq in postings}
        all_docs.update(docs)
    # Score using term frequency sum (naive)
    doc_scores = {}
    for (node, term), postings in results_per_node.items():
        for doc_id, freq in postings:
            doc_scores[doc_id] = doc_scores.get(doc_id, 0) + freq
    # Return top 10
    sorted_docs = sorted(doc_scores.items(), key=lambda x: -x[1])[:10]
    return [doc_id for doc_id, _ in sorted_docs]

Crawler:

def crawl_and_index(doc_id, text):
    terms = tokenize(text)  # e.g., using simple split
    freq_map = Counter(terms)
    for term, freq in freq_map.items():
        node = get_node_for_term(term)
        send_to_node(node, {'type': 'posting', 'term': term, 'doc_id': doc_id, 'freq': freq})

This is the essence of a term-partitioned system. The main point is that the index is built by pushing data to the correct node based on term hash. Query processing pulls data from the relevant nodes.


5. Query Execution in Detail – Algorithms and Optimization

Now let’s look at query execution more deeply, focusing on the algorithms used to intersect posting lists in a distributed environment.

5.1 Naïve Intersection Over the Network

The simplest way: fetch the entire posting list for each query term from its owning node, then intersect client-side. For a query like “famous beagle”, where both terms are rare, the posting lists might be small (e.g., a few thousand entries each), so transferring them is cheap. But for “the quick”, “the” may have billions of entries. Transferring that is impossible. Therefore, term-partitioned systems must avoid fetching huge lists. This is why common terms are often excluded from intersection (i.e., treated as “stop words”) or handled via early pruning.

5.2 Document-Sorted Posting Lists and Skipping

Even if we fetch a large list, we need efficient intersection. Posting lists are usually sorted by document ID. Intersection of two sorted lists is O(|A|+|B|). If one list is huge (A) and the other is small (B), we can iterate over B and do a binary search in A (O(|B| log |A|)) or use “skip pointers” (bump pointers) to jump ahead in A. For a term-partitioned system, we can also do the intersection on the node that owns the smaller posting list: it requests the larger list from its owning node in a streamed fashion (chunks) and intersects incrementally. This reduces memory usage.

5.3 Top-K Retrieval with TA and FA

Instead of fetching full posting lists, we can use threshold algorithms (like Fagin’s Algorithm) that fetch only enough to guarantee the top-K. In TA, each node returns the next highest-scoring document (based on a local scoring function) along with its score. The coordinator merges these streams, keeps a threshold of the minimum score among top-K seen so far, and stops when the next best possible score from any node cannot exceed the threshold. However, in a term-partitioned system, scoring is global: the final score (e.g., BM25) depends on term frequencies and document lengths, which are stored in the posting entries. So the coordinator needs to consolidate per-document scores from multiple terms. This is doable if the coordinator holds the document length table (or can request it from a metadata server).

Distributed TA for term-partitioned systems:

  • Each index node returns, for each query term, a stream of (doc_id, score_component) sorted by decreasing score_component.
  • The coordinator merges these streams by doc_id, summing the score components. It maintains a top-K heap.
  • When the sum of the maximum possible score from unseen entries (based on the current values in the streams) falls below the K-th score in the heap, it stops.

This algorithm requires sorted access by local score. Since posting lists are sorted by doc_id, not by score, you’d need a separate index (e.g., a score-sorted index) or you could retrieve posting lists and then sort them – which might be worse.

In practice, many systems (like Elasticsearch) use a combination: first find candidate docs using intersection (or union) of doc-id-sorted lists (possibly using WAND or BM25 WAND), then compute full scores for those candidates.

5.4 Using Prefix/Suffix Caching

Since term partitioning creates natural groupings (e.g., all terms starting with ‘a’ on node 0), query coordinators can cache the mapping of term ranges to nodes. Moreover, they can cache the entire posting list of frequent queries (e.g., “the”, “of”) on the coordinator itself (since they tend to be big but rarely change? Actually, “the” is stable but still huge). Instead, they can cache the inverse: the top-K results for popular queries at the coordinator level (as Google does) – this is independent of partitioning.


6. Fault Tolerance and Consistency in a Term-Partitioned Search Engine

Distributed systems must handle failures gracefully. In a term-partitioned system, the failure of an index node that owns a rare term is minor – only queries containing that term will be affected, and they can potentially still return results for other terms (though incomplete). However, the failure of a node that owns a common term (like “the”) can break a huge fraction of queries.

6.1 Replication for High Availability

To mitigate this, index nodes can be replicated. For each shard of terms (or each term range), we have a primary and one or more replicas. When a crawler sends a posting, it can send it to both primary and replicas (synchronous or asynchronous replication). Queries can be sent to any replica, typically the least loaded. This also helps with load balancing for common terms.

Replication introduces consistency challenges: if a replica lags behind, some queries may see stale results. This is usually acceptable in search (eventual consistency). For stronger guarantees, we can use quorum-based writes (write to majority) and read from primary.

6.2 Node Addition and Removal

Adding a new node to a term-partitioned system is tricky because you need to reassign terms. If you use consistent hashing (like in a DHT), terms are mapped to keys on a ring; when a node joins, it takes over responsibility for a range of keys from its neighbors. The indexing nodes must then transfer the corresponding posting lists. This can be done in the background while the system continues to serve queries (maybe with slightly stale data during migration).

In a term-partitioned system, adding a node to relieve the load on a hot term is even more complex: you need to split the posting list of “the” across multiple nodes. This is essentially splitting a shard that was previously a single node. That requires either pre-splitting from the start (using virtual nodes) or dynamic splitting with routing table updates.

6.3 Handling Hot Terms Dynamically

One way to handle hot terms is to monitor the load per term and, for terms that exceed a threshold, automatically replicate or split them. For example, maintain a load monitor that tracks the number of queries hitting each term and the size of its posting list. When a term becomes hot, the system can add a new node and copy the posting list for that term to the new node, then update the routing table so that queries for that term go to either node (round-robin or consistent hash with replication factor >1). This is essentially autoscaling at the term level.


7. Real-World Examples and Hybrid Approaches

Let’s look at how real systems have used term partitioning, partially or fully.

7.1 Google’s Early Architecture (circa 1998–2003)

According to the original Google paper and later descriptions, the index was stored on multiple “shards” based on term ranges: e.g., words starting with ‘a’ on one machine, ‘b’ on another. This is pure term partitioning! However, they also had document-level replication for fault tolerance. The infamous “Lexicon” component held global statistics and term mapping. As the web grew, they moved to document partitioning because of the skew problem: the “the” machine was overwhelmed. By the time of the Caffeine indexing system (2010), Google had largely moved to document-level sharding, but with term-level information replicated across shards as needed.

7.2 YaCy – A P2P Search Engine

YaCy is a peer-to-peer search engine where every user runs a node. The index is shared across peers using a DHT. The DHT maps terms to peers (term partitioning). Each peer stores postings for the terms it is responsible for. Queries are routed to the peers that own the query terms. This is perhaps the purest example of term partitioning in a widely used system. YaCy handles skew by using a “word split” mechanism: for very common words, multiple peers hold the posting list (replication). The DHT also helps with load distribution.

7.3 Elasticsearch and Document Partitioning

Elasticsearch is the most popular modern search engine. It uses document partitioning (by default, a hash of the document ID). Each shard is an independent Lucene index. Queries are broadcast to all shards, and results are merged. Elasticsearch supports global statistics via a “DFS” mode that computes accurate IDF across all shards (trading latency for accuracy). It does not use term partitioning. However, some users have experimented with custom routing based on term hash to achieve term-level aggregation, but this is not standard.

7.4 Federated Search

In federated search (also called distributed information retrieval), multiple independent search engines (each covering a different domain) are queried together. This is inherently term-partitioned at the collection level: each engine holds a subset of terms (those that appear in its documents). A broker sends a query to all engines, gets results, and merges. This is document partitioning across engines, but term partitioning within each engine? Not exactly – it’s a hierarchy.

7.5 Hybrid Systems

The best of both worlds: use document partitioning to scale horizontally for storage, and within each shard, use term partitioning to accelerate internal query processing? Actually, many systems use local term indexing on each shard, which is just the normal inverted index. So it’s not really hybrid. A true hybrid might be: document shards for browsing (e.g., if you want to search within a 10% sample), combined with a separate term-partitioned index for full vocabulary queries. This is overly complex.

Another hybrid: use a two-tier index. The first tier is a small, term-partitioned index that contains only the most popular terms (head terms) to allow fast intersection for common queries. The second tier is a document-partitioned index containing the full collection. Queries that include common terms are served by the first tier; others fall back to the second. This idea appears in some research papers (e.g., “Earlybird” from Twitter used a real-time index and a separate offline index, but not exactly term-partitioned).


8. Performance Analysis: Document vs. Term Partitioning

We need to quantify the trade-offs. Let’s define some parameters:

  • D = total number of documents
  • T = total number of terms (vocabulary size)
  • Q = average number of terms per query
  • A = average posting list size for a non-stop-word term
  • N = number of nodes

Document Partitioning:

  • Each node stores D/N documents.
  • Storage per node: O( (D/N) * avg_terms_per_doc ) for the index.
  • Query cost: each node scans its local index for all Q terms. The total network traffic: O(N _ result_size) because each node returns its top-K results (K small). However, the coordinator must merge N _ K results.
  • Latency: dominated by the slowest node (load balance matters). Since each node does similar work, it’s O( average local index intersection time ).
  • IDF accuracy: approximate unless you maintain global term frequency table.

Term Partitioning:

  • Each node stores a subset of terms, roughly T/N terms.
  • Storage per node: sum over its terms of posting list lengths. If term frequencies follow a power law (Zipf), the most frequent term’s posting list is of size O(D). So a node that owns a very frequent term stores O(D) postings, while others store much less. This is the skew.
  • Query cost: only Q nodes are involved (if each term maps to a distinct node). Network traffic: each of those Q nodes sends its posting list (size depends on term popularity). For a query with a common term, that node sends a huge list.
  • Latency: dominated by the node with the largest posting list for a query term. If the query includes a very common term, that node becomes the bottleneck.
  • IDF accuracy: exact per term since each node owns the global posting list.

Which is better?

  • For rare-term queries (e.g., “xylophone”), term partitioning is vastly superior: only one node processes the query, and the posting list is tiny.
  • For common-term queries (e.g., “the quick brown”), document partitioning is better because work is spread across all nodes, and each node only searches its own small subset of documents.
  • In practice, most queries contain a mix of common and rare terms, but the rare terms (like “quick” and “brown” in “the quick brown”) drive the query; “the” can be ignored. So term partitioning might still be acceptable if we treat common terms as stop words.
  • However, the skew in term distribution means that a few term nodes (e.g., for “the”, “a”, “and”, “to”) will always be overloaded, even if we ignore those terms in queries (they still need to be stored and updated). So term partitioning requires careful handling of head terms.

Network cost comparison:

  • Document partitioning: each query sends N messages (one per node), each message contains the query terms. The responses are top-K results (small). So network cost ~ O(N * Q + N*K). For N=100 nodes, that’s 100 messages.
  • Term partitioning: each query sends Q messages (to the Q term nodes), and each response contains a posting list (which could be huge). So network cost ~ O(Q * average_posting_list_size). For Q=3, if one term has a list of 1M documents, that’s 1M doc IDs transferred. That’s huge compared to 100 top-K results. This is why term partitioning often requires compression, skip lists, and incremental processing.

Given these trade-offs, document partitioning is more common in modern systems. But term partitioning has a place in specialized scenarios.


9. Building a Toy Distributed Search Engine – A Walkthrough

Let’s solidify concepts with a more detailed toy implementation using Python and ZeroMQ (or just sockets). We’ll build:

  • 3 index nodes (node0, node1, node2)
  • A coordinator that accepts HTTP requests and returns search results.
  • A simple crawler that ingests a list of sample documents.

We’ll use consistent hashing to map terms to nodes, with a replication factor of 2 (each term stored on two nodes for fault tolerance). We’ll implement a basic AND query using streaming intersection.

Step 1: Consistent hashing

We simulate a ring with 3 physical nodes and 9 virtual nodes (3 per physical). Terms are hashed to a virtual node, and we maintain a mapping.

Step 2: Index node logic

Each index node stores postings as a sorted list of (doc_id, freq) for each term. It can perform intersection of two lists (sorted by doc_id) locally.

Step 3: Crawler pushes postings to all replicas.

For each term, it determines the set of two nodes (primary and next virtual node) and sends the posting.

Step 4: Coordinator for queries

Given query “quick fox”, it determines the set of nodes for each term (all replicas). It picks one replica per term (e.g., the primary) and sends a request: “I need posting list for term ‘quick’ and ‘fox’”. Alternatively, it can forward the query to one of the nodes (e.g., the node with smallest term list). That node then fetches the other term’s list from its own node (if it also owns the other term) or from another node.

We’ll implement the simpler approach: coordinator fetches both lists and intersects locally.

Code snippets (abbreviated but illustrative):

import zmq
import hashlib

# Mapping from term to list of node IDs (2 replicas)
def get_nodes_for_term(term):
    hash_val = int(hashlib.md5(term.encode()).hexdigest(), 16)
    # Assume a ring of 9 virtual nodes: physical nodes: 0,1,2 each have 3 virtual nodes
    vnode = hash_val % 9
    # Replica: next vnode
    vnode2 = (vnode + 1) % 9
    return [vnode // 3, vnode2 // 3]

Then for each node, we set up a subscriber socket (or REQ/REP). The coordinator sends a request to node0 (if it contains both terms?) Actually, for multi-term, we need to send separate requests.

In practice, for a production system, you would use gRPC with streaming responses. But the key idea is there.

Step 5: Sample query

Let’s test with 1000 sample documents. The system will load the documents, build the term-partitioned index, and then serve queries. We can compare latency for different query types.

This toy demonstrates the core architecture. A full implementation would include:

  • LSM-tree storage for persistent index.
  • Compression (e.g., Variable Byte Encoding for doc IDs).
  • A crawler that expands links (a simple BFS).
  • A user interface.

But for the purpose of understanding, the code above suffices.


10. Trade-offs, Pitfalls, and When to Use Term Partitioning

Let’s summarize when term partitioning is a good choice and when it’s not.

When term partitioning shines:

  • The vocabulary is small and static (e.g., a controlled vocabulary for a domain-specific search engine, like legal terms or medical ontologies).
  • Queries tend to consist of rare terms (e.g., scientific search, patent search).
  • You need exact IDF and global statistics without extra servers.
  • The system is built around a DHT (like in P2P search) where each peer is unreliable but collectively they cover the term space.
  • Low write throughput (crawling is not too frequent) because updates to common terms can be expensive.

When term partitioning fails:

  • The web-scale with highly skewed term distribution (e.g., “the”).
  • High write throughput (updating billions of documents every day).
  • Need for low latency on all queries, including those with common terms.
  • Simplicity of operations (adding/removing nodes is more complex).

Mitigation strategies (as discussed):

  • Replicate or split head terms.
  • Use stop-word handling.
  • Hybrid approach: document sharding for most of the index, term-partitioned index only for long-tail terms.
  • Use a caching layer for hot queries.

In many cases, document partitioning is the default, but understanding term partitioning gives you a deeper appreciation for the design space of distributed search engines.


11. Future Directions

As search engines evolve, new paradigms like learned indexes (using neural nets to map terms to posting blocks) and approximate nearest neighbor search for semantic retrieval might change the way we think about partitioning. However, the fundamental tension between partitioning strategies remains. With the rise of serverless search (e.g., AWS CloudSearch, Algolia), the exact partitioning is hidden from developers, but internally the providers use a mix of techniques.

Another interesting direction is adaptive partitioning: dynamically switching between document and term partitioning based on query load and term popularity. Research in this area (e.g., “Delta” from some SIGIR papers) suggests that a runtime repartitioning can improve performance.

Finally, term partitioning combined with graph databases (like for knowledge graph search) may find a new application: partitioning entities by property types (similar to terms). In a graph, you might partition by predicate (property) and have each node store all triples with that predicate. That is essentially term partitioning over predicates.


Conclusion

We started with the simple question: How can we scale a search engine beyond one machine? The inverted index, a beautiful and simple data structure, must be distributed. The two fundamental approaches—document partitioning and term partitioning—each have deep trade-offs. Document partitioning is robust and widely used, but it needs global statistics and broadcasts every query to all nodes. Term partitioning, on the other hand, directly supports exact global IDF and only involves nodes that own the query terms, but it struggles with skew and network transfer of large posting lists.

We have explored term partitioning in detail: its architecture, how it integrates with decentralized crawling, query execution algorithms, fault tolerance, and practical considerations. We saw that early Google used term partitioning, and modern P2P search engines like YaCy still rely on it. We also discussed hybrid approaches and when to choose one over the other.

The web is vast, and no single partitioning strategy is perfect. The real lesson is that in distributed systems, every choice is a trade-off. Understanding term partitioning gives you another tool in your toolbox—a solution for specific constraints where document partitioning falls short. Whether you are building a small search engine for a private corpus or studying large-scale distributed systems, the concepts of term partitioning will deepen your understanding of how to tame the web.

Next time you hit Enter on a search query, think about the millions of machines working together, whispering posting lists across the network, and intersecting them to bring you the answer in a fraction of a second. That is the magic of distributed search, and term partitioning is one of its secret ingredients.


If you enjoyed this deep dive, consider implementing your own toy distributed search engine using term partitioning. Start with a small collection (like Wikipedia articles), use consistent hashing and replication, and measure the impact of skew. The learning is in the doing.