Implementing A Scalable Web Crawler With Distributed Frontier And Politeness Policies

2025-05-28 · Leonardo Benicio
Summary

A comprehensive technical exploration of implementing a scalable web crawler with distributed frontier and politeness policies, covering key concepts, practical implementations, and real-world applications.

Contents

Here is an expanded version of your blog post, structured to reach the requested depth and length, with additional sections, code examples, and technical detail.


The Digital Cartographer: Building a Production-Grade Web Crawler from Scratch

The internet is not a library. It is not a well-organized, indexed repository of knowledge where every volume sits quietly on its designated shelf, spine-out, waiting to be checked out. It is a sprawling, chaotic, and hyperkinetic digital ecosystem: a city built overnight that never stops being demolished and rebuilt, block by block, second by second. Websites are born, die, change their content, and change their URLs. Links appear, rot, and are repurposed. Every minute, hundreds of hours of video are uploaded, thousands of articles are published, and millions of social media posts are generated. To make sense of this chaos, to power the search engines we rely on, the AI models we train, and the data analytics that drive modern business, we need cartographers of the digital frontier. We need web crawlers.

A web crawler, at its simplest definition, is a bot—a piece of software that systematically browses the World Wide Web, typically for the purpose of indexing. Starting from a seed set of URLs, it downloads a page, extracts the links contained within that page, adds those links to a queue (the “frontier”), and continues the process ad infinitum. In theory, the concept is trivial—a for loop over a list of strings. In practice, building a crawler that operates at the scale of the entire web, or even a significant portion of it, is one of the most challenging engineering problems in distributed systems.

Why does this matter to you? If you are a software engineer, a data scientist, or an architect, you have likely felt the gravitational pull of this problem. Perhaps you need to monitor prices across dozens of e-commerce sites for a competitive analysis tool. Perhaps you are building a custom search engine for an internal knowledge base that spans thousands of subdomains. Perhaps you need to scrape public datasets for research, or feed a Large Language Model (LLM) with fresh, high-quality training data. In every one of these scenarios, the naive approach—a simple loop with urllib or requests—will fail spectacularly the moment you hit a rate limit, a redirect loop, a dynamically rendered JavaScript page, or a site that has 10 million pages.

This post is your guide through that complexity. We will not just define a web crawler. We will dissect its anatomy, confront its engineering challenges, and walk through the design decisions required to build a system that is polite, efficient, resilient, and scalable. We will move from a single-threaded Python script to the architectural blueprint of a distributed crawl farm. By the end, you will understand not just how to crawl, but why modern web crawling is a discipline that sits at the intersection of network engineering, systems design, and data ethics.


Part I: The Scale of the Problem—Why This Isn’t a Simple Loop

Before we write a single line of code, we must internalize the sheer magnitude of the data we are trying to consume. According to estimates from various sources (including Internet Live Stats and Google’s own published metrics), the indexable web—that is, pages not hidden behind login walls or paywalls—contains over 50 billion pages. This excludes the “deep web”: databases, dynamically generated pages, and content behind search forms, which is estimated to be 400 to 500 times larger. Even crawling a small fraction of this—say, 1 billion pages—presents staggering resource requirements.

Bandwidth: A single HTML page averages between 50 KB and 500 KB. If we assume an average of 200 KB per page, downloading 1 billion pages represents 200 TB of raw data transfer. If we crawled at a steady state of 5 Gbps (a very fast dedicated connection, roughly $5000/month for a bare metal server), transferring 200 TB would take approximately 90 hours of continuous, non-stop download. This is just the download. We haven’t parsed, stored, or indexed a single byte.

Time: Most real-world crawlers must be polite. They cannot hammer a single server with thousands of requests per second. A typical politeness delay between requests to the same domain is 1-5 seconds. If a crawler must wait 3 seconds between requests to example.com, and example.com has 1 million pages, the crawler will be stuck on that single domain for over 34 days. A crawler that respects robots.txt and uses adaptive delays will be even slower.

Storage: 200 TB of raw HTML is just the beginning. You will need to store extracted text, metadata, link graphs, and possibly rendered page screenshots. A complete crawl archive, compressed, can easily double or triple the raw size. Storing this data reliably requires distributed file systems (like HDFS or cloud object stores like S3), redundant storage, and backup strategies.

Dynamic Content: Over 60% of the web today uses JavaScript to render content. Crawling a dynamic page means not just downloading the HTML shell, but executing the JavaScript (often requiring a headless browser like Chromium via Puppeteer or Playwright), waiting for network requests to complete, and waiting for the DOM to stabilize. A single page fetch via a headless browser can take 3-10 seconds, compared to 0.3 seconds for a static HTML download. This multiplies both bandwidth and time costs by an order of magnitude.

This is why a naive for url in url_list: requests.get(url) approach is not a web crawler. It is a toy. A production system must treat bandwidth as a precious commodity, time as a critical constraint, and politeness as a non-negotiable requirement.


Part II: The Architecture of a Production-Grade Crawler

A web crawler, when viewed as a software system, is a series of interconnected modules, each with a specific responsibility. These modules must communicate with each other, often across distributed machines, while maintaining consistency and reliability. Let’s break down the core components.

2.1 The URL Frontier (The Scheduler)

The URL Frontier is the heart of the crawler. It is not a simple FIFO queue. It is a sophisticated scheduling data structure that decides which URL to crawl next. Its primary responsibilities are:

  • Politeness: Ensure that we do not request pages from the same host (e.g., www.example.com) too frequently. This means we maintain a per-host queue with a minimum delay.
  • Prioritization: Some URLs are more important than others. Newly discovered pages on a high-authority site (e.g., a breaking news article on reuters.com) should be crawled before older, low-importance pages on a personal blog.
  • De-duplication: A URL must never be enqueued twice. This requires a fast, distributed, persistent “seen” set.

A classic implementation of a polite URL Frontier is the Mercator scheme (from the seminal paper by Heydon and Najork at Compaq/DEC). In this scheme, the Frontier maintains a set of “frontier queues,” one for each host. A single “selector” thread pops a host from a global priority queue, then pops a URL from that host’s queue, sends it to the fetcher, and waits for a configurable delay (e.g., 1 second) before that host can be selected again.

Here is a simplified Python sketch using a priority queue and per-host rate limiting:

import time
import threading
from collections import defaultdict
from queue import PriorityQueue, Queue

class PoliteFrontier:
    def __init__(self, default_delay=1.0):
        self.default_delay = default_delay
        # host -> (last_fetch_time, Queue of URLs)
        self.host_queues = defaultdict(lambda: (0.0, Queue()))
        self.ready_hosts = PriorityQueue()  # priority is last_fetch_time
        self.lock = threading.Lock()
        self._running = True

    def add_url(self, url, priority=0):
        from urllib.parse import urlparse
        host = urlparse(url).netloc
        with self.lock:
            _, q = self.host_queues[host]
            q.put((priority, url))

    def get_next_url(self):
        while self._running:
            # Block until a host is ready
            with self.lock:
                if self.ready_hosts.empty():
                    # Re-evaluate all hosts
                    now = time.time()
                    for host, (last_fetch, q) in self.host_queues.items():
                        if not q.empty():
                            wait_time = max(0, self.default_delay - (now - last_fetch))
                            self.ready_hosts.put((now + wait_time, host))
            try:
                ready_time, host = self.ready_hosts.get(timeout=1)
            except:
                continue
            if ready_time > time.time():
                time.sleep(ready_time - time.time())
            # Fetch from host queue
            with self.lock:
                last_fetch, q = self.host_queues[host]
                if not q.empty():
                    priority, url = q.get(block=False)
                    self.host_queues[host] = (time.time(), q)
                    return url
        return None

Of course, in a distributed system, this Frontier lives in memory across many machines, and the host queues are synchronized via a service like Redis, Apache Kafka, or a distributed key-value store. The key insight is that the Frontier is the traffic cop of the crawler, and it must enforce politeness globally.

2.2 The Fetcher (The Downloader)

The Fetcher is responsible for making the actual HTTP request. While this sounds like a simple call to requests.get(), a production fetcher must handle a litany of edge cases:

  • Robots Exclusion Protocol: Before fetching any URL, the crawler must check the domain’s robots.txt file (which is itself fetched and cached). The Fetcher must obey directives like Disallow: /private/ and Crawl-Delay: 10.
  • Redirections: HTTP redirects (301, 302, 307) must be followed, but a loop must be detected. A maximum redirect depth (e.g., 10 hops) is standard.
  • HTTP Headers: The Fetcher sends a User-Agent string that identifies the crawler (e.g., MyCrawler/1.0). It must also handle cookies, session management, and conditional requests (If-Modified-Since, ETag) to avoid re-downloading unchanged content.
  • Connection Pooling: Opening a new TCP connection to every host for every request is prohibitively slow and resource-intensive. The Fetcher uses connection pools (via urllib3 or custom HTTP clients) to reuse connections.
  • Timeouts and Retries: Network failures are inevitable. A fetcher must have configurable timeouts (connect, read, total) and a retry policy (e.g., exponential backoff with jitter, up to 3 retries). A 4xx or 5xx status code is not necessarily a failure; the crawler must decide whether to record the error, retry, or skip.

Here is a practical example using the aiohttp library for asynchronous HTTP, which is essential for high throughput:

import aiohttp
import asyncio
from urllib.parse import urlparse

class AsyncFetcher:
    def __init__(self, user_agent="MyCrawler/1.0", max_connections=100):
        self.user_agent = user_agent
        self.connector = aiohttp.TCPConnector(limit=max_connections)
        self.robots_cache = {}

    async def fetch_robots(self, host):
        if host in self.robots_cache:
            return self.robots_cache[host]
        url = f"https://{host}/robots.txt"
        try:
            async with aiohttp.ClientSession(connector=self.connector) as session:
                async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                    text = await resp.text()
                    self.robots_cache[host] = text
                    return text
        except:
            self.robots_cache[host] = None
            return None

    def is_allowed(self, robots_text, path, user_agent):
        if not robots_text:
            return True  # No robots.txt = allowed (with caution)
        # Simple parser; a real one would use 'reppy' or 'robotparser'
        lines = robots_text.splitlines()
        current_agent = None
        disallowed_paths = []
        for line in lines:
            if line.startswith("User-agent:"):
                current_agent = line.split(":")[1].strip()
            if current_agent == "*" or current_agent == user_agent:
                if line.startswith("Disallow:"):
                    disallow_path = line.split(":")[1].strip() if ":" in line else "/"
                    disallowed_paths.append(disallow_path)
        for disallowed in disallowed_paths:
            if path.startswith(disallowed):
                return False
        return True

    async def fetch(self, url):
        parsed = urlparse(url)
        host = parsed.netloc
        path = parsed.path if parsed.path else "/"
        robots = await self.fetch_robots(host)
        if not self.is_allowed(robots, path, self.user_agent):
            return None, "ROBOTS_DISALLOW"
        try:
            headers = {"User-Agent": self.user_agent}
            async with aiohttp.ClientSession(connector=self.connector) as session:
                async with session.get(url, headers=headers, timeout=aiohttp.ClientTimeout(total=30), allow_redirects=True) as resp:
                    content = await resp.text()
                    status = resp.status
                    # Handle redirects manually if needed
                    final_url = str(resp.url)
                    return content, status
        except asyncio.TimeoutError:
            # Retry logic would go here
            return None, "TIMEOUT"
        except aiohttp.ClientError as e:
            return None, str(e)

2.3 The Parser and Link Extractor

Once the content is fetched, the Parser extracts useful data. This module is deceptively complex. The primary operations are:

  • Text Extraction: Strip the page of HTML tags. This requires an HTML parser (e.g., lxml, BeautifulSoup, html.parser). A good parser handles malformed HTML gracefully.
  • Link Extraction: Find all <a href="..."> tags and extract the href attribute. Links can be relative (e.g., /about) or absolute (https://...). They must be resolved to absolute URLs.
  • Metadata Extraction: Extract <meta name="description" content="...">, Open Graph tags (og:title, og:image), and structured data (JSON-LD, Microdata, RDFa). This is critical for search engines and AI training.
  • Canonicalization: The same content on a site can appear at multiple URLs (e.g., example.com, example.com/index.html, www.example.com/). The parser should look for <link rel="canonical" href="..."> and record the canonical URL.

A critical engineering concern is link normalization. Before a URL is added to the Frontier, it must be cleaned up:

  1. Convert scheme to lowercase (HTTPhttp).
  2. Remove fragments (#section).
  3. Remove default ports (http://example.com:80/http://example.com/).
  4. Sort query parameters alphabetically.
  5. Remove trailing slashes where appropriate.

Failure to normalize leads to duplicate crawling. An enormous amount of web traffic is wasted on duplicated URLs.

Here is a Python snippet showing link extraction and normalization:

from urllib.parse import urljoin, urlparse, urlunparse
from bs4 import BeautifulSoup

def normalize_url(url):
    parsed = urlparse(url)
    # Lowercase scheme and netloc
    scheme = parsed.scheme.lower()
    netloc = parsed.netloc.lower()
    # Remove fragments
    path = parsed.path.rstrip('/') if len(parsed.path) > 1 else parsed.path
    # Sort query parameters
    query = parsed.query
    if query:
        params = sorted(query.split("&"))
        query = "&".join(params)
    return urlunparse((scheme, netloc, path, parsed.params, query, ""))

def extract_links(base_url, html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    links = set()
    for anchor in soup.find_all('a', href=True):
        raw_href = anchor['href']
        # Skip mailto, javascript, etc.
        if raw_href.startswith(('mailto:', 'javascript:', 'tel:')):
            continue
        absolute_url = urljoin(base_url, raw_href)
        # Only include http/https
        if absolute_url.startswith(('http://', 'https://')):
            normalized = normalize_url(absolute_url)
            links.add(normalized)
    return links

2.4 The Deduplication Engine

The deduplication engine is the module that ensures the same content is not processed twice. There are two layers of deduplication:

  1. URL Deduplication: Before adding a URL to the Frontier, check if it has been seen before. This requires a fast, persistent, set-like data structure. The naive approach is a hash set in memory, but this cannot scale. For a distributed system, we use a Bloom Filter or a distributed hash table (e.g., Redis Set, Cassandra, or DynamoDB).

    A Bloom Filter is a probabilistic data structure. It can tell us, with a small false-positive rate, whether a URL is definitely not seen, or probably seen. A well-tuned Bloom Filter uses very little memory. For example, storing 1 billion URLs in a Bloom Filter with a 1% false-positive rate requires about 1.7 GB of memory (compared to 40+ GB for a hash set). This is a classic space-time trade-off.

    import hashlib
    import mmh3  # MurmurHash3
    
    class BloomFilter:
        def __init__(self, capacity, error_rate=0.01):
            self.capacity = capacity
            self.error_rate = error_rate
            # Calculate optimal size and number of hash functions
            self.bit_size = int(-capacity * (error_rate ** (1.0 / 2)) / (2.0 ** 2))  # Simplified
            self.bit_array = [False] * self.bit_size
            self.num_hashes = int(self.bit_size / capacity * 0.693)  # ~ln2
    
        def _hashes(self, item):
            # Use double hashing to generate multiple independent hashes
            h1 = mmh3.hash128(item, seed=0)
            h2 = mmh3.hash128(item, seed=1)
            for i in range(self.num_hashes):
                yield (h1 + i * h2) % self.bit_size
    
        def add(self, item):
            for bit in self._hashes(item):
                self.bit_array[bit] = True
    
        def check(self, item):
            for bit in self._hashes(item):
                if not self.bit_array[bit]:
                    return False
            return True
    

    In a distributed system, the Bloom Filter is partitioned across machines, or we use a service like RedisBloom.

  2. Content Deduplication (Near-Duplicate Detection): Two URLs might be different but contain identical or nearly identical content (e.g., example.com/article?print=1 and example.com/article?mobile=1). To detect this, we compute a fingerprint of the page content. The most popular technique is SimHash (developed by Moses Charikar at Google). SimHash generates a 64-bit fingerprint such that two similar documents have a small Hamming distance between their fingerprints. We can then store all fingerprints and, for each new page, check if any existing fingerprint is within a threshold (e.g., 3 bits).

    Implementing SimHash is beyond this post, but the key idea is that the deduplication engine ensures we don’t waste storage and processing on the same content viewed through different URL lenses.

2.5 The Data Store (The Archive)

The final module stores the results. The data model for a crawler must include at least:

  • URL: The canonical URL.
  • Crawl timestamp: When it was last fetched.
  • HTTP status code: 200, 404, etc.
  • Content hash: An MD5 or SHA256 of the raw HTML (for deduplication and change detection).
  • Raw HTML: The full source.
  • Extracted text: A cleaned, text-only version.
  • Metadata: Title, description, keywords, Open Graph data.
  • Outlinks: The list of URLs found on the page (important for graph analysis).

This data is typically stored in a combination of systems:

  • Raw HTML: In a distributed object store (S3, HDFS, Azure Blob). Each blob is named by the content hash or a row key.
  • Metadata and Outlinks: In a wide-column NoSQL database (Cassandra, HBase) or a relational database (PostgreSQL with TimescaleDB) for fast querying by URL or timestamp.
  • Full-text index: In an inverted index like Elasticsearch or Apache Solr, for search functionality.

A simplified schema for a document store (using a key-value model) might look like this:

Key: url_hash (MD5 of canonical URL)
Value (JSON):
{
    "url": "http://example.com/page",
    "crawl_time": 1734567890,
    "status": 200,
    "content_hash": "a1b2c3d4...",
    "raw_html_s3_key": "bucket/crawls/2025/01/18/a1b2c3d4.html.gz",
    "extracted_text": "This is the page content...",
    "title": "Example Page",
    "outlinks": ["http://example.com/other", "http://example.com/contact"],
    "metadata": {
        "og:image": "http://example.com/image.png"
    }
}

Part III: Politeness, Ethics, and the Legal Landscape

Building a crawler is not just an engineering challenge; it is a social and legal one. The web is a shared resource, and a misbehaving crawler can disrupt services, incur costs, or even trigger legal action.

3.1 The Robots Exclusion Protocol (robots.txt)

We touched on this earlier, but it deserves deeper treatment. robots.txt is a voluntary standard. There is no law requiring a crawler to obey it. However, ignoring it is considered hostile and can lead to your IP being blocked. The file sits at the root of a domain. A crawler should:

  1. Cache the file: Avoid fetching robots.txt for every request. Cache it with a TTL (Time-To-Live), typically 24 hours.
  2. Parse correctly: The format is line-based. User-agent: * matches all crawlers. Disallow: /secret blocks the entire /secret path. Allow: /secret/public can override a disallow for a subpath. Crawl-delay: 5 is a request to wait 5 seconds between requests.
  3. Handle sitemaps: The file may also contain Sitemap: https://example.com/sitemap.xml directives. A smart crawler will fetch sitemaps, which provide a direct list of URLs to crawl—a huge efficiency gain.

3.2 Rate Limiting and Backoff

Even if robots.txt says Crawl-delay: 5, you must be prepared for server-side rate limiting. A server may return HTTP 429 (Too Many Requests) or 503 (Service Unavailable) when you hit an unspoken limit. A good crawler implements exponential backoff with jitter:

import time
import random

def backoff_delay(attempt, base_delay=1.0, max_delay=60.0):
    sleep_time = min(max_delay, base_delay * (2 ** attempt))
    jitter = random.uniform(0, sleep_time * 0.1)  # 10% jitter
    return sleep_time + jitter

3.3 Legal Considerations

Several landmark cases have shaped the legality of web crawling:

  • eBay v. Bidder’s Edge (2000): A court ruled that eBay could block a crawler under trespass to chattels, even though the crawler never caused server overload. This established that even “peaceful” crawling without permission could be illegal if the site explicitly forbids it.
  • hiQ Labs v. LinkedIn (2019): A court ruled that scraping publicly accessible data (not behind a login) does not violate the Computer Fraud and Abuse Act (CFAA). LinkedIn was ordered to allow hiQ to scrape public profiles. This was a major win for crawlers, but the case is still evolving.
  • GDPR and Privacy: Crawling personal data of EU citizens without a legitimate interest may violate GDPR. If you store names, email addresses, or any PII, you need a legal basis.

The fundamental rule is: Do not be a nuisance. If a website asks you to stop via a 403, a cease-and-desist letter, or a robots.txt directive, you must stop, even if you believe you have a legal right to continue. The cost of a lawsuit far outweighs the data you might gather.


Part IV: Crawling the Modern Web—JavaScript, Single-Page Apps, and Infinite Scroll

The web of the 1990s was static HTML. The modern web is a JavaScript application. Crawling a site like Airbnb, Twitter, or a React-based docs site is fundamentally different.

4.1 The Challenge

When you request a URL from a modern web app, the server often returns an empty HTML shell with a single <div id="ibcaf63f"></div> and a bundle of JavaScript. The real content is loaded via AJAX/Fetch API calls, rendered client-side, and injected into the DOM after the JavaScript executes. A traditional fetcher that only downloads HTML will get nothing.

4.2 The Solution: Headless Browsers

To extract content from a dynamic page, you need a headless browser: a full browser engine (Chromium, Firefox) that runs without a visible window. Popular libraries include:

  • Puppeteer (Node.js, with Python bindings via pyppeteer or puppeteer-python)
  • Playwright (Python, Node.js, Java, .NET)
  • Selenium (Python, Java, etc.)

A headless browser fetches the page, executes JavaScript, waits for network requests to settle, and then allows you to query the rendered DOM.

from playwright.async_api import async_playwright
import asyncio

async def fetch_dynamic(url, timeout=30000):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        # Set a realistic user-agent
        await page.set_extra_http_headers({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
        })
        await page.goto(url, wait_until="networkidle", timeout=timeout)
        # Wait for a specific selector to ensure content is loaded
        try:
            await page.wait_for_selector("main", timeout=5000)
        except:
            pass  # Content might be loaded differently
        html = await page.content()
        await browser.close()
        return html

The Cost: A headless browser fetcher is 10-100x slower and uses 100-500 MB of RAM per tab. A production crawler cannot run a headless browser for every URL. It must have a detection mechanism to decide if a page is dynamic. This can be done by:

  • Looking for <noscript> tags that indicate alternative content.
  • Checking the HTML size; very small HTML (e.g., < 10 KB) is often a JS shell.
  • Making an initial “light” fetch and analyzing the response headers (e.g., X-Powered-By: React or a Content-Type: text/html; charset=utf-8 paired with a small body).

4.3 Infinite Scroll

Infinite scroll pages (e.g., Twitter timeline, Pinterest) load more content as the user scrolls. To crawl these, the crawler must simulate scrolling and wait for new elements to appear. Using Playwright:

async def scroll_to_bottom(page, max_scrolls=10):
    prev_height = 0
    for i in range(max_scrolls):
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await page.wait_for_timeout(2000)  # Wait for new content
        new_height = await page.evaluate("document.body.scrollHeight")
        if new_height == prev_height:
            break
        prev_height = new_height

A crawler must decide on a scroll limit (e.g., 50 items) to avoid infinite loops.


Part V: Distribution and Scaling—The Crawl Farm

A single machine can only crawl so much. To reach billions of pages, you need a crawl farm: a distributed cluster of machines working together. This introduces coordination problems.

5.1 Partitioning the URL Space

How do you ensure that two different machines don’t crawl the same URL? The simplest approach is vertical partitioning by host. Each machine in the cluster is assigned a range of hosts (e.g., Machine 1 handles a-m.example.com, Machine 2 handles n-z.example.com). The Frontier is sharded by host hash. This ensures politeness is maintained per-host (since only one machine touches a host) and eliminates the need for a central URL queue.

5.2 Coordination via a Message Broker

For dynamic load balancing, we can use a message broker like Apache Kafka or RabbitMQ. The Frontier publishes URL discovery events to a topic. Worker machines consume from the topic. Each worker is assigned a specific partition based on the host hash. Kafka guarantees that all messages for the same host end up in the same partition, thus consumed by the same worker.

A simplified pipeline:

  1. Seed URLs are inserted into a “crawl requests” topic.
  2. Fetcher workers consume from the topic, fetch the page, and publish the raw HTML + metadata to a “crawled data” topic.
  3. Parser workers consume the “crawled data” topic, extract links and text, and publish new URLs back to the “crawl requests” topic (after deduplication).
  4. Indexer workers consume the parsed data and store it in the database.

This architecture is horizontally scalable: you can add more workers for any stage by increasing the partition count.

5.3 State Management and Fault Tolerance

In a distributed system, machines crash. The crawler must be resilient.

  • Checkpointing: The Frontier should periodically persist its state (the queue of URLs per host) to a distributed store (e.g., ZooKeeper, etcd, or a database). If a worker crashes, another worker can restore the state.
  • Idempotency: Fetcher workers should be idempotent. If a crash causes a page to be fetched twice, the deduplication engine will eventually ignore the duplicate. This is acceptable.
  • Dead Letter Queue: Malformed URLs, pages that consistently time out, or pages that return 500s should be sent to a dead letter queue for manual review, rather than retried infinitely.

Part VI: A Concrete Use Case—Building a Price Monitoring Crawler

Let’s bring this all together with a concrete, smaller-scale example. Assume you want to monitor prices of a specific product (e.g., the “Anker 737 Power Bank”) across three major e-commerce sites in real-time.

Step 1: Define the Seed URLs

You know the product page URLs for three sites:

  • https://www.amazon.com/dp/B0BZ7T9R5B
  • https://www.bestbuy.com/site/anker-737-power-bank/654321
  • https://www.walmart.com/ip/Anker-737/987654

Step 2: Design a Targeted Crawler

You don’t need a general-purpose crawler for billions of pages. You need a focused one. The architecture is the same, but simpler:

  • Frontier: A PostgreSQL table with a url column and a last_crawled timestamp. Poll this table periodically.
  • Politeness: Add a 2-second delay between requests to the same domain. Since you have only three domains, this is trivial.
  • Fetcher: Use a headless browser (Playwright) because Amazon and Walmart heavily rely on JavaScript. Set a realistic User-Agent and use residential proxies (from a provider like Bright Data) to avoid IP blocks. E-commerce sites are aggressive at blocking unknown IPs, especially if you crawl every 5 minutes.
  • Parser: Instead of extracting all links, write a specific parser that finds the product price. For Amazon, the price is often inside an element with the ID corePrice_desktop or a <span class="c86966241">. You will use CSS selectors or XPath.
async def parse_amazon_price(page):
    try:
        price_element = await page.query_selector("#corePrice_desktop .a-price .a-offscreen")
        if price_element:
            price_text = await price_element.get_attribute("textContent")
            return price_text.strip()
    except:
        return None
  • Storage: Store the result in a time-series database (e.g., InfluxDB or PostgreSQL with TimescaleDB) as (timestamp, site, price).
  • Alerting: If the price drops below a threshold, send an alert via email or webhook.

This crawler is polite, dynamic-content-aware, and small enough to run on a single cheap VPS. It demonstrates that you don’t always need a distributed crawl farm; you need the right architecture for the right problem.


Conclusion: The Cartographer’s Toolkit

Building a web crawler is a rite of passage for many engineers. It forces you to confront real-world constraints: network unreliability, probabilistic data structures, distributed state management, ethical boundaries, and the messy reality of the modern web. The naive for loop is a trap. The real system is a carefully orchestrated set of modules—Frontier, Fetcher, Parser, Deduplicator, Store—each with its own engineering challenges.

We have covered the fundamental components and their implementation details:

  • The URL Frontier must be polite, prioritizing important pages while respecting per-host delays.
  • The Fetcher must handle robots.txt, redirects, connection pools, retries, and dynamic JavaScript.
  • The Parser must normalize URLs, extract structured metadata, and prepare text for storage.
  • The Deduplication Engine must use probabilistic data structures like Bloom Filters and SimHash to avoid wasting bandwidth.
  • The Distributed Architecture must partition the workload by host, coordinate via message brokers, and handle crashes gracefully.

As you venture out to build your own crawler—whether for a price monitoring tool, an internal search engine, or an AI training dataset—remember that you are not just writing code. You are building a digital cartographer. You are mapping the chaotic, ever-changing frontier of the web. Do it with respect for the resources you consume, with an eye on the legal landscape, and with the engineering rigor required to keep the system running at scale.

The internet is not a library. It is a city under constant construction. Grab your compass. Start mapping.