Spider Blog - Rust vs. Python for Web Scraping: Why We Rewrote Everything

Spider started as a Python project. Like every other web scraping tool at the time, it was built on top of Scrapy, BeautifulSoup, and requests. That stack worked. It worked for small jobs, it worked for demos, and it worked long enough to teach us exactly where it would stop working.

This is the story of why we threw all of it away and rewrote the entire crawl engine in Rust. Not because Rust was trendy. Because we ran out of options.

The Python Wall

In early development, our scraper ran fine on a single machine. You could point it at a site, let it churn through a few hundred pages, and get clean output in minutes. The problems started when we tried to run it at the scale our users actually needed.

The GIL Problem

Python’s Global Interpreter Lock means that only one thread executes Python bytecode at a time. For I/O-bound scraping, you can work around this with asyncio or multiprocessing. But “work around” is the key phrase. Every workaround introduced its own complexity. Multiprocessing meant serializing data between processes, managing shared state through pipes or queues, and dealing with memory duplication. asyncio helped with network-bound waits, but the moment you needed to parse HTML or transform data, you were back to single-threaded execution.

At 10,000 concurrent connections, our Python scraper consumed roughly 500MB of memory just for the runtime overhead: interpreter state, object headers, reference counting metadata, and the garbage collector’s bookkeeping. The actual page data was on top of that.

Parsing Was the Bottleneck

BeautifulSoup is excellent for correctness. It handles malformed HTML gracefully and provides a clean API. But it is not fast. Parsing a moderately complex page took 5 to 15 milliseconds in Python. Multiply that by thousands of pages per second and parsing becomes the dominant cost, not network latency. We profiled our crawler extensively and found that 60% of CPU time was spent in HTML parsing and tree traversal, not in waiting for HTTP responses.

Throughput Ceiling

Our best Python configuration, running on a 16-core machine with a combination of multiprocessing and asyncio, topped out at roughly 50 pages per second of sustained throughput. We could spike higher on simple static pages, but the moment pages had any complexity (nested tables, large DOMs, lots of links to extract), throughput dropped. Garbage collection pauses added unpredictable latency spikes. The p99 response time hovered around 2 to 4 seconds, with occasional spikes above 10 seconds during major GC sweeps.

We tried every optimization in the book: PyPy, Cython extensions for hot paths, connection pooling with aiohttp, pre-forked worker pools. Each one bought us a percentage improvement. None of them broke through the fundamental ceiling.

Why Rust

The decision to rewrite was not impulsive. We evaluated Go, C++, and Rust over several weeks, building small proof-of-concept crawlers in each.

Go was compelling for its simplicity and goroutine model, but we kept hitting performance walls in HTML parsing. The Go HTML parser ecosystem at the time was less mature, and the garbage collector, while better than Python’s, still introduced latency variance under heavy allocation pressure.

C++ would have given us the raw performance we needed, but the development velocity tradeoff was brutal. Memory safety bugs in a long-running network service that handles untrusted HTML input from arbitrary websites felt like a guaranteed source of CVEs.

Rust gave us three things simultaneously: performance comparable to C++, memory safety without a garbage collector, and a genuinely excellent async runtime in tokio. The borrow checker eliminated entire categories of bugs at compile time. The type system made illegal states unrepresentable. And the zero-cost abstraction principle meant that writing high-level, readable code did not come with a runtime penalty.

We wrote a prototype crawler in Rust over two weeks. It immediately outperformed our production Python system on a single core.

The Benchmarks

After the full rewrite, we ran side-by-side benchmarks on identical hardware (c6g.xlarge, 4 vCPU, 8GB RAM) crawling the same set of 50,000 pages across 500 domains.

Memory Usage

Metric	Python (Scrapy + aiohttp)	Rust (spider crate)
Baseline (idle)	120MB	8MB
1K concurrent connections	280MB	22MB
10K concurrent connections	510MB	78MB
Peak during 50K page crawl	1.2GB	140MB

The difference comes down to object representation. A Python string carries a reference count, a type pointer, a hash cache, and length metadata before you get to the actual bytes. In Rust, a String is three machine words: pointer, length, capacity. No overhead, no indirection, no garbage collector metadata.

Throughput

Metric	Python	Rust
Static pages (simple HTML)	85 pages/sec	1,200 pages/sec
JS-rendered (headless Chrome)	12 pages/sec	45 pages/sec
Mixed workload (smart mode)	50 pages/sec	520 pages/sec
Sustained over 1 hour	42 pages/sec	500+ pages/sec

The “sustained over 1 hour” row matters most. Python throughput degraded over time as memory fragmentation accumulated and GC pressure increased. Rust throughput was flat. The number at minute 60 was the same as minute 1.

Latency

Percentile	Python	Rust
p50	180ms	4ms
p95	1,200ms	9ms
p99	3,800ms	12ms
p99.9	11,000ms	28ms

The p99 improvement from 3.8 seconds to 12 milliseconds was the single metric that convinced us the rewrite was worth every hour we invested. Tail latency kills user experience in an API product. When your p99 is measured in seconds, some percentage of your users are always waiting. When it is measured in milliseconds, your API feels instant.

Deployment

Metric	Python	Rust
Docker image size	1.2GB	45MB
Cold start time	8 seconds	120ms
Binary dependencies	pip, virtualenv, system libs	Single static binary
Required runtime	Python 3.11 + system packages	None

The Rust binary is a single statically-linked executable. No interpreter, no virtualenv, no dependency resolution at deploy time. The Docker image is a FROM scratch container with the binary and TLS certificates. That is it.

Architecture of the Spider Crate

The spider crate is the open-source core that powers spider.cloud. Here is how the key pieces fit together.

Concurrent Connection Pool

Every crawl job gets a connection pool backed by hyper and tokio. Connections are reused across requests to the same host, and the pool automatically manages keep-alive, TLS session resumption, and connection limits per domain. Backpressure is handled through tokio’s cooperative scheduling: if a downstream host is slow, the crawler naturally slows its request rate to that host without blocking work on other domains.

Smart Mode Routing

The default crawl mode inspects each URL and makes a routing decision:

Fetch the page with a lightweight HTTP request first.
Check if the response contains signals that JavaScript rendering is needed (empty body, SPA framework markers, meta refresh tags).
If JS rendering is required, re-fetch through headless Chrome.
If the HTTP response is sufficient, skip Chrome entirely.

This means static sites are crawled at raw HTTP speed while dynamic sites still get full browser rendering. The cost savings are significant: headless Chrome is roughly 10x more expensive per page than a direct HTTP fetch.

Streaming Responses

Pages are processed as they arrive, not buffered into memory and batch-processed. The HTML parser operates on a streaming byte buffer, extracting links and content incrementally. This keeps memory usage proportional to the number of in-flight requests, not the total volume of data processed.

Using the Spider Crate

The crate is designed to be straightforward. A basic crawl looks like this:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com")
        .with_limit(500)
        .with_respect_robots_txt(true)
        .build()
        .unwrap();

    website.crawl().await;

    for page in website.get_pages().unwrap().iter() {
        println!("URL: {} | Status: {}", page.get_url(), page.status_code);
    }
}

For streaming processing where you want to handle pages as they arrive:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");

    let mut rx = website.subscribe(0).unwrap();

    let join_handle = tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            let url = page.get_url();
            let content = page.get_html();
            // Process each page as it arrives.
            // This runs concurrently with the crawl itself.
            println!("Received {} ({} bytes)", url, content.len());
        }
    });

    website.crawl().await;
    let _ = join_handle.await;
}

And for the hosted API at spider.cloud, a single HTTP request handles everything:

curl -X POST https://api.spider.cloud/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "limit": 100,
    "return_format": "markdown"
  }'

The return_format: "markdown" parameter runs every page through our content extraction pipeline, stripping navigation, ads, footers, and boilerplate. The output is clean markdown ready for LLM consumption, vector embedding, or RAG pipelines.

Key Rust Advantages for Scraping

After two years of running this system in production, these are the Rust features that matter most for web scraping specifically.

Zero-Cost Abstractions

We write high-level iterator chains, combinators, and generic code. The compiler optimizes it down to the same machine code you would write by hand. There is no “abstraction tax” for writing clean, maintainable code.

Async with Tokio

Tokio’s work-stealing scheduler distributes tasks across all available cores automatically. A crawl job with 10,000 concurrent connections uses 10,000 lightweight tasks, not 10,000 OS threads. Context switching happens in userspace, not in the kernel. The overhead per task is roughly 300 bytes, compared to 8KB minimum for an OS thread stack.

No GC Pauses

This is the single biggest advantage for a latency-sensitive service. There is no garbage collector. Memory is freed deterministically when values go out of scope. There are no stop-the-world pauses, no generational promotion overhead, no finalizer queues. Our p99 latency is stable because there is nothing running in the background that can unpredictably freeze the process.

Fearless Concurrency

The borrow checker prevents data races at compile time. When you have thousands of concurrent tasks all processing pages, extracting links, updating shared crawl state, and writing results, the guarantee that your code is free of data races is worth its weight in developer hours. Bugs that would manifest as rare, unreproducible race conditions in Python or Java simply cannot compile in Rust.

Compile-Time Safety

The type system catches entire categories of bugs before the code ever runs. Null pointer dereferences, use-after-free, buffer overflows: these are compile errors, not runtime crashes. For a service that processes untrusted HTML from millions of websites, this is not academic. It is a direct reduction in production incidents.

The Honest Challenges

Rust is not free. Here is what it cost us.

Longer Development Cycles (Initially)

The first three months were slow. Fighting the borrow checker, learning lifetime annotations, understanding when to use Arc<Mutex<T>> versus channels versus atomics. Every Rust developer goes through this phase. It is real, and it cannot be skipped. Our velocity was roughly 40% of what it had been in Python during those initial months.

After the learning curve flattened, our velocity recovered to about 80% of Python speed for new feature development. But the code we shipped had far fewer bugs, so the net time including debugging and incident response was actually lower.

HTML Parsing Ecosystem

When we started, the Rust HTML parsing ecosystem was smaller than Python’s. There was no equivalent to BeautifulSoup’s forgiving, batteries-included approach. We used a combination of html5ever (the same parser that powers Firefox’s HTML parsing) for spec-compliant parsing and built custom extraction logic on top. The ecosystem has improved significantly since then, with crates like scraper and select.rs filling the gap.

Hiring

Finding Rust developers is harder than finding Python developers. This is a real constraint. We addressed it by hiring strong systems programmers (C, C++, Go backgrounds) and investing in their Rust ramp-up. The language’s tooling (cargo, clippy, rust-analyzer) and documentation make onboarding faster than you might expect, but the initial pool of candidates is undeniably smaller.

Compile Times

A full release build of the spider crate takes several minutes. Incremental builds during development are fast (a few seconds), but CI pipelines and Docker builds feel the weight of the Rust compiler. We use cargo-chef in Docker builds to cache dependency compilation, which helps significantly.

What It Powers Today

The rewrite was worth it. Here is where things stand.

The spider crate has over 2,200 stars on GitHub and is MIT licensed. It is used by developers and companies building RAG pipelines, search engines, monitoring systems, and data platforms. The crate handles everything from single-page fetches to million-page crawls.

spider.cloud is the hosted platform built on top of the crate. It sustains 50,000 requests per minute. The engine’s own processing overhead is under 15ms at p99: virtually all of the wall-clock time in a crawl is the target site’s response, not Spider’s. The crawl engine auto-scales to handle traffic spikes without manual intervention. Smart mode routing, anti-bot bypass, proxy rotation, and content transformation are all built into the API.

Spider ships as a document loader in the major AI frameworks (LangChain, LlamaIndex, and others). If you are building a RAG pipeline or agent, there is likely already a connector for your stack.

Internal LLM for Extraction

Separately, we run small, task-specific language models in-house for data extraction. These are not general-purpose chat models. They handle narrow tasks like converting messy HTML into structured JSON, which keeps per-page costs fixed rather than scaling with token count from external API calls.

The Takeaway

If you are scraping a few hundred pages, Python is fine. Scrapy is a great framework. BeautifulSoup does its job. You do not need Rust for a weekend project.

If you are building a product that needs to crawl millions of pages reliably, with predictable latency and controlled resource usage, Rust changes the equation fundamentally. The upfront investment in learning the language pays for itself many times over in reduced infrastructure costs, fewer production incidents, and the ability to scale without rewriting again.

We rewrote everything because we had to. We would do it again without hesitation.

The spider crate is open source and MIT licensed: github.com/spider-rs/spider. The hosted API is at spider.cloud.

Get web data insights

Weekly tips on web scraping, AI pipelines, and product updates.

Rust vs. Python for Web Scraping: Why We Rewrote Everything