Rust vs. Python for Web Scraping: Why We Rewrote Everything
Spider started as a Python project. Like every other web scraping tool at the time, it was built on top of Scrapy, BeautifulSoup, and requests. That stack worked. It worked for small jobs, it worked for demos, and it worked long enough to teach us exactly where it would stop working.
This is the story of why we threw all of it away and rewrote the entire crawl engine in Rust. Not because Rust was trendy. Because we ran out of options.
The Python Wall
In early development, our scraper ran fine on a single machine. You could point it at a site, let it churn through a few hundred pages, and get clean output in minutes. The problems started when we tried to run it at the scale our users actually needed.
The GIL Problem
Python’s Global Interpreter Lock means that only one thread executes Python bytecode at a time. For I/O-bound scraping, you can work around this with asyncio or multiprocessing. But “work around” is the key phrase. Every workaround introduced its own complexity. Multiprocessing meant serializing data between processes, managing shared state through pipes or queues, and dealing with memory duplication. asyncio helped with network-bound waits, but the moment you needed to parse HTML or transform data, you were back to single-threaded execution.
At 10,000 concurrent connections, our Python scraper consumed roughly 500MB of memory just for the runtime overhead: interpreter state, object headers, reference counting metadata, and the garbage collector’s bookkeeping. The actual page data was on top of that.
Parsing Was the Bottleneck
BeautifulSoup is excellent for correctness. It handles malformed HTML gracefully and provides a clean API. But it is not fast. Parsing a moderately complex page took 5 to 15 milliseconds in Python. Multiply that by thousands of pages per second and parsing becomes the dominant cost, not network latency. We profiled our crawler extensively and found that 60% of CPU time was spent in HTML parsing and tree traversal, not in waiting for HTTP responses.
Throughput Ceiling
Our best Python configuration, running on a 16-core machine with a combination of multiprocessing and asyncio, topped out at roughly 50 pages per second of sustained throughput. We could spike higher on simple static pages, but the moment pages had any complexity (nested tables, large DOMs, lots of links to extract), throughput dropped. Garbage collection pauses added unpredictable latency spikes. The p99 response time hovered around 2 to 4 seconds, with occasional spikes above 10 seconds during major GC sweeps.
We tried every optimization in the book: PyPy, Cython extensions for hot paths, connection pooling with aiohttp, pre-forked worker pools. Each one bought us a percentage improvement. None of them broke through the fundamental ceiling.
Why Rust
The decision to rewrite was not impulsive. We evaluated Go, C++, and Rust over several weeks, building small proof-of-concept crawlers in each.
Go was compelling for its simplicity and goroutine model, but we kept hitting performance walls in HTML parsing. The Go HTML parser ecosystem at the time was less mature, and the garbage collector, while better than Python’s, still introduced latency variance under heavy allocation pressure.
C++ would have given us the raw performance we needed, but the development velocity tradeoff was brutal. Memory safety bugs in a long-running network service that handles untrusted HTML input from arbitrary websites felt like a guaranteed source of CVEs.
Rust gave us three things simultaneously: performance comparable to C++, memory safety without a garbage collector, and a genuinely excellent async runtime in tokio. The borrow checker eliminated entire categories of bugs at compile time. The type system made illegal states unrepresentable. And the zero-cost abstraction principle meant that writing high-level, readable code did not come with a runtime penalty.
We wrote a prototype crawler in Rust over two weeks. It immediately outperformed our production Python system on a single core.
The Benchmarks
After the full rewrite, we ran side-by-side benchmarks on identical hardware (c6g.xlarge, 4 vCPU, 8GB RAM) crawling the same set of 50,000 pages across 500 domains.
Memory Usage
| Metric | Python (Scrapy + aiohttp) | Rust (spider crate) |
|---|---|---|
| Baseline (idle) | 120MB | 8MB |
| 1K concurrent connections | 280MB | 22MB |
| 10K concurrent connections | 510MB | 78MB |
| Peak during 50K page crawl | 1.2GB | 140MB |
The difference comes down to object representation. A Python string carries a reference count, a type pointer, a hash cache, and length metadata before you get to the actual bytes. In Rust, a String is three machine words: pointer, length, capacity. No overhead, no indirection, no garbage collector metadata.
Throughput
| Metric | Python | Rust |
|---|---|---|
| Static pages (simple HTML) | 85 pages/sec | 1,200 pages/sec |
| JS-rendered (headless Chrome) | 12 pages/sec | 45 pages/sec |
| Mixed workload (smart mode) | 50 pages/sec | 520 pages/sec |
| Sustained over 1 hour | 42 pages/sec | 500+ pages/sec |
The “sustained over 1 hour” row matters most. Python throughput degraded over time as memory fragmentation accumulated and GC pressure increased. Rust throughput was flat. The number at minute 60 was the same as minute 1.
Latency
| Percentile | Python | Rust |
|---|---|---|
| p50 | 180ms | 4ms |
| p95 | 1,200ms | 9ms |
| p99 | 3,800ms | 12ms |
| p99.9 | 11,000ms | 28ms |
The p99 improvement from 3.8 seconds to 12 milliseconds was the single metric that convinced us the rewrite was worth every hour we invested. Tail latency kills user experience in an API product. When your p99 is measured in seconds, some percentage of your users are always waiting. When it is measured in milliseconds, your API feels instant.
Deployment
| Metric | Python | Rust |
|---|---|---|
| Docker image size | 1.2GB | 45MB |
| Cold start time | 8 seconds | 120ms |
| Binary dependencies | pip, virtualenv, system libs | Single static binary |
| Required runtime | Python 3.11 + system packages | None |
The Rust binary is a single statically-linked executable. No interpreter, no virtualenv, no dependency resolution at deploy time. The Docker image is a FROM scratch container with the binary and TLS certificates. That is it.
Architecture of the Spider Crate
The spider crate is the open-source core that powers spider.cloud. Here is how the key pieces fit together.
Concurrent Connection Pool
Every crawl job gets a connection pool backed by hyper and tokio. Connections are reused across requests to the same host, and the pool automatically manages keep-alive, TLS session resumption, and connection limits per domain. Backpressure is handled through tokio’s cooperative scheduling: if a downstream host is slow, the crawler naturally slows its request rate to that host without blocking work on other domains.
Smart Mode Routing
The default crawl mode inspects each URL and makes a routing decision:
- Fetch the page with a lightweight HTTP request first.
- Check if the response contains signals that JavaScript rendering is needed (empty body, SPA framework markers, meta refresh tags).
- If JS rendering is required, re-fetch through headless Chrome.
- If the HTTP response is sufficient, skip Chrome entirely.
This means static sites are crawled at raw HTTP speed while dynamic sites still get full browser rendering. The cost savings are significant: headless Chrome is roughly 10x more expensive per page than a direct HTTP fetch.
Streaming Responses
Pages are processed as they arrive, not buffered into memory and batch-processed. The HTML parser operates on a streaming byte buffer, extracting links and content incrementally. This keeps memory usage proportional to the number of in-flight requests, not the total volume of data processed.
Using the Spider Crate
The crate is designed to be straightforward. A basic crawl looks like this:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com")
.with_limit(500)
.with_respect_robots_txt(true)
.build()
.unwrap();
website.crawl().await;
for page in website.get_pages().unwrap().iter() {
println!("URL: {} | Status: {}", page.get_url(), page.status_code);
}
}
For streaming processing where you want to handle pages as they arrive:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
let mut rx = website.subscribe(0).unwrap();
let join_handle = tokio::spawn(async move {
while let Ok(page) = rx.recv().await {
let url = page.get_url();
let content = page.get_html();
// Process each page as it arrives.
// This runs concurrently with the crawl itself.
println!("Received {} ({} bytes)", url, content.len());
}
});
website.crawl().await;
let _ = join_handle.await;
}
And for the hosted API at spider.cloud, a single HTTP request handles everything:
curl -X POST https://api.spider.cloud/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"limit": 100,
"return_format": "markdown"
}'
The return_format: "markdown" parameter runs every page through our content extraction pipeline, stripping navigation, ads, footers, and boilerplate. The output is clean markdown ready for LLM consumption, vector embedding, or RAG pipelines.
Key Rust Advantages for Scraping
After two years of running this system in production, these are the Rust features that matter most for web scraping specifically.
Zero-Cost Abstractions
We write high-level iterator chains, combinators, and generic code. The compiler optimizes it down to the same machine code you would write by hand. There is no “abstraction tax” for writing clean, maintainable code.
Async with Tokio
Tokio’s work-stealing scheduler distributes tasks across all available cores automatically. A crawl job with 10,000 concurrent connections uses 10,000 lightweight tasks, not 10,000 OS threads. Context switching happens in userspace, not in the kernel. The overhead per task is roughly 300 bytes, compared to 8KB minimum for an OS thread stack.
No GC Pauses
This is the single biggest advantage for a latency-sensitive service. There is no garbage collector. Memory is freed deterministically when values go out of scope. There are no stop-the-world pauses, no generational promotion overhead, no finalizer queues. Our p99 latency is stable because there is nothing running in the background that can unpredictably freeze the process.
Fearless Concurrency
The borrow checker prevents data races at compile time. When you have thousands of concurrent tasks all processing pages, extracting links, updating shared crawl state, and writing results, the guarantee that your code is free of data races is worth its weight in developer hours. Bugs that would manifest as rare, unreproducible race conditions in Python or Java simply cannot compile in Rust.
Compile-Time Safety
The type system catches entire categories of bugs before the code ever runs. Null pointer dereferences, use-after-free, buffer overflows: these are compile errors, not runtime crashes. For a service that processes untrusted HTML from millions of websites, this is not academic. It is a direct reduction in production incidents.
The Honest Challenges
Rust is not free. Here is what it cost us.
Longer Development Cycles (Initially)
The first three months were slow. Fighting the borrow checker, learning lifetime annotations, understanding when to use Arc<Mutex<T>> versus channels versus atomics. Every Rust developer goes through this phase. It is real, and it cannot be skipped. Our velocity was roughly 40% of what it had been in Python during those initial months.
After the learning curve flattened, our velocity recovered to about 80% of Python speed for new feature development. But the code we shipped had far fewer bugs, so the net time including debugging and incident response was actually lower.
HTML Parsing Ecosystem
When we started, the Rust HTML parsing ecosystem was smaller than Python’s. There was no equivalent to BeautifulSoup’s forgiving, batteries-included approach. We used a combination of html5ever (the same parser that powers Firefox’s HTML parsing) for spec-compliant parsing and built custom extraction logic on top. The ecosystem has improved significantly since then, with crates like scraper and select.rs filling the gap.
Hiring
Finding Rust developers is harder than finding Python developers. This is a real constraint. We addressed it by hiring strong systems programmers (C, C++, Go backgrounds) and investing in their Rust ramp-up. The language’s tooling (cargo, clippy, rust-analyzer) and documentation make onboarding faster than you might expect, but the initial pool of candidates is undeniably smaller.
Compile Times
A full release build of the spider crate takes several minutes. Incremental builds during development are fast (a few seconds), but CI pipelines and Docker builds feel the weight of the Rust compiler. We use cargo-chef in Docker builds to cache dependency compilation, which helps significantly.
What It Powers Today
The rewrite was worth it. Here is where things stand.
The spider crate has over 2,200 stars on GitHub and is MIT licensed. It is used by developers and companies building RAG pipelines, search engines, monitoring systems, and data platforms. The crate handles everything from single-page fetches to million-page crawls.
spider.cloud is the hosted platform built on top of the crate. It sustains 50,000 requests per minute. The engine’s own processing overhead is under 15ms at p99: virtually all of the wall-clock time in a crawl is the target site’s response, not Spider’s. The crawl engine auto-scales to handle traffic spikes without manual intervention. Smart mode routing, anti-bot bypass, proxy rotation, and content transformation are all built into the API.
Spider ships as a document loader in the major AI frameworks (LangChain, LlamaIndex, and others). If you are building a RAG pipeline or agent, there is likely already a connector for your stack.
Internal LLM for Extraction
Separately, we run small, task-specific language models in-house for data extraction. These are not general-purpose chat models. They handle narrow tasks like converting messy HTML into structured JSON, which keeps per-page costs fixed rather than scaling with token count from external API calls.
The Takeaway
If you are scraping a few hundred pages, Python is fine. Scrapy is a great framework. BeautifulSoup does its job. You do not need Rust for a weekend project.
If you are building a product that needs to crawl millions of pages reliably, with predictable latency and controlled resource usage, Rust changes the equation fundamentally. The upfront investment in learning the language pays for itself many times over in reduced infrastructure costs, fewer production incidents, and the ability to scale without rewriting again.
We rewrote everything because we had to. We would do it again without hesitation.
The spider crate is open source and MIT licensed: github.com/spider-rs/spider. The hosted API is at spider.cloud.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.