Scraping 1 Million Pages: What Actually Happens
Most scraping benchmarks test 100 URLs and call it a day. That tells you almost nothing about what happens at production scale, where DNS caches expire, connection pools saturate, anti-bot systems escalate, and half the internet returns something other than a clean 200.
We ran a real crawl: 1,000,000 pages across 10,247 domains, using Spider’s cloud API with smart mode and markdown output. This is the full engineering log. Every number comes from production telemetry, not a synthetic test.
The setup
The target list was assembled from three sources:
- 4,100 documentation sites (static HTML, sitemaps, minimal JS)
- 3,800 SaaS marketing and product pages (mix of SSR and client-side rendering, many behind Cloudflare)
- 2,347 news and media sites (heavy JS, ad networks, paywalls, aggressive bot detection)
Each domain was capped at 100 pages via the limit parameter. The API was called in batches of 50 concurrent crawl requests, each targeting one domain. Output format was markdown. Smart mode was enabled (the default), meaning Spider decides per-page whether to use a lightweight HTTP fetch or a full headless Chrome render.
The crawl configuration:
import requests
import os
import concurrent.futures
def crawl_domain(domain):
return requests.post('https://api.spider.cloud/crawl',
headers={
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
},
json={
"url": f"https://{domain}",
"limit": 100,
"return_format": "markdown",
"request": "smart",
"readability": True,
},
stream=True,
)
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as pool:
futures = {pool.submit(crawl_domain, d): d for d in domains}
We used application/jsonl with streaming so results arrived incrementally. No buffering 100 pages in memory per domain.
Timeline: 1 million pages in 3 hours 47 minutes
Total wall-clock time from the first request to the last response: 3 hours, 47 minutes, 12 seconds.
Here is how throughput evolved over the run:
| Time window | Pages/second (avg) | Cumulative pages | Notes |
|---|---|---|---|
| 0:00 - 0:15 | 42 | 37,800 | Ramp-up. DNS resolution warming, connection pools filling. |
| 0:15 - 0:45 | 89 | 197,400 | Steady state for static sites. Smart mode routing ~78% of pages to HTTP-only. |
| 0:45 - 1:30 | 104 | 478,200 | Peak sustained throughput. Chrome pool fully warm. |
| 1:30 - 2:15 | 91 | 723,900 | SaaS and media domains entering the queue. More Chrome renders, more anti-bot. |
| 2:15 - 3:00 | 72 | 918,900 | Rate limiting from target sites. Retry backoff kicking in. |
| 3:00 - 3:47 | 29 | 1,000,000 | Long tail. Slow domains, retries, sites returning intermittent 5xx. |
Sustained average across the full run: 73.4 pages/second.
The throughput curve follows a pattern we see on every large crawl. You get a fast ramp to peak, a plateau while the easy domains clear, and then a long tail where the remaining difficult domains drag down the average. The last 8% of pages took 20% of the total time.
What smart mode actually did
Smart mode inspected each page and chose the cheapest execution path. Across 1 million pages:
| Execution path | Pages | Percentage |
|---|---|---|
| HTTP fetch only | 681,204 | 68.1% |
| Headless Chrome render | 318,796 | 31.9% |
That split matters for cost. An HTTP-only fetch is significantly cheaper and faster than spinning up a Chrome tab. If we had forced chrome mode on everything, the crawl would have cost roughly 2.4x more and taken an estimated 6+ hours.
Smart mode made the right call on 96.3% of pages (verified by spot-checking a random sample of 2,000 pages where the HTTP path was chosen, confirming the markdown output matched what Chrome would have produced). The remaining 3.7% were cases where JavaScript loaded additional content that the HTTP fetch missed. For most workloads, that trade-off is worth it.
What went wrong (and how Spider handled it)
At 1 million pages, everything that can go wrong does go wrong. Here is the full error distribution:
| Error type | Count | Percentage of total | Resolution |
|---|---|---|---|
| Success (2xx) | 963,718 | 96.37% | Clean response. |
| HTTP 403 Forbidden | 14,291 | 1.43% | Anti-bot block. Spider retried with proxy rotation and stealth escalation. 11,402 recovered on retry. |
| HTTP 429 Too Many Requests | 8,847 | 0.88% | Rate limited by target site. Exponential backoff, then resumed. 8,104 recovered. |
| HTTP 5xx Server Error | 5,218 | 0.52% | Target site errors. Retried up to 3 times. 3,011 recovered. |
| Connection timeout | 4,109 | 0.41% | Target unreachable or slow. 2,688 recovered after retry. |
| DNS resolution failure | 2,414 | 0.24% | Domain expired, misconfigured, or DNS server overloaded. 891 recovered on second attempt. |
| TLS handshake error | 847 | 0.08% | Expired or misconfigured certificates. Not retried. |
| Empty response body | 556 | 0.06% | Server accepted connection but returned nothing. 312 recovered on retry. |
Final success rate after retries: 98.98% (989,826 pages with usable content).
The 10,174 permanent failures break down to: 2,889 hard 403s (sites that block all automated traffic regardless of proxy), 1,523 unrecoverable DNS failures, 2,207 persistent 5xx from flaky servers, 1,421 timeouts on extremely slow sites, 847 TLS errors, 244 empty responses, and 1,043 miscellaneous (malformed HTML, encoding errors, redirect loops).
DNS resolution bottlenecks
At minute 8, we hit our first throughput dip. DNS resolution across 10,247 unique domains creates a burst of queries that can overwhelm upstream resolvers. Spider handles this by:
- Running its own caching DNS resolver pool
- Parallelizing resolution across multiple upstream providers
- Pre-resolving the next batch of domains while the current batch is crawling
The dip lasted about 90 seconds before the cache warmed and throughput recovered.
Connection pool exhaustion
Around minute 25, with 50 concurrent crawl jobs each maintaining their own connection pools, total open connections peaked at approximately 12,000. Spider’s connection pooling is per-domain with configurable limits, so no single target site gets hammered with hundreds of concurrent requests. The pool drains gracefully as domains complete.
Anti-bot escalation
The media and SaaS segments triggered the most anti-bot responses. Here is the escalation pattern Spider follows:
- First attempt: Standard request with rotating datacenter proxy
- 403 detected: Retry with residential proxy and browser-like headers
- Still blocked: Retry with full Chrome render, residential proxy, and fingerprint randomization
- Still blocked: Mark as permanently failed, move on
Of the 14,291 initial 403s, this escalation recovered 11,402 (79.8%). The remaining 2,889 were sites running aggressive bot detection that blocks all automated traffic (some government sites, banking portals, and sites using custom WAFs).
Rate limiting from target sites
HTTP 429 responses clustered around a handful of large sites that enforce strict rate limits. Spider respects Retry-After headers and applies exponential backoff. The 8,847 rate-limited requests were spread across only 127 domains, meaning most of the internet did not rate-limit us.
Recovery rate was 91.6%, with the remaining 743 being domains where the rate limit window exceeded our patience threshold (we capped retry wait at 120 seconds per request).
Memory pressure
At peak, the 50 concurrent crawl streams with Chrome rendering consumed significant memory server-side. Spider’s architecture handles this through:
- Per-request memory budgets that cap how much HTML/DOM data is buffered
- Streaming markdown output (via JSONL) so completed pages flush immediately rather than accumulating
- Chrome tab recycling: tabs are reused across pages within a domain crawl rather than spawning a new process per page
From the client side, streaming with application/jsonl meant our client process never held more than a few hundred pages in memory at once.
The numbers
Data volume
| Metric | Value |
|---|---|
| Total pages with content | 989,826 |
| Total markdown output | 47.3 GB |
| Average page size (markdown) | 49.2 KB |
| Median page size (markdown) | 31.7 KB |
| Largest single page | 2.8 MB (an auto-generated API reference) |
| Smallest useful page | 127 bytes (a redirect landing page) |
The 47.3 GB of clean markdown is what you would feed into a vector store or LLM. For comparison, the raw HTML for the same pages totaled approximately 194 GB. Spider’s readability and markdown conversion stripped roughly 75% of the noise (navigation, ads, scripts, boilerplate).
Cost breakdown
| Component | Credits used | Estimated cost |
|---|---|---|
| HTTP-only fetches (681,204 pages) | 204,361 | $204.36 |
| Chrome renders (318,796 pages) | 191,278 | $191.28 |
| Proxy usage (datacenter) | 38,420 | $38.42 |
| Proxy usage (residential, escalation) | 24,118 | $24.12 |
| Bandwidth (47.3 GB output) | 47,300 | $47.30 |
| Compute time | 22,712 | $22.71 |
| Retries (successful) | 18,440 | $18.44 |
| Total | 546,629 | $546.63 |
Effective cost: $0.55 per 1,000 pages. That includes retries, proxy escalation, Chrome rendering, and bandwidth. Pages that only needed HTTP fetches cost roughly $0.30 per 1,000. Pages that required Chrome plus residential proxies cost closer to $1.20 per 1,000.
Success rate by site complexity
| Site category | Domains | Pages attempted | Final success rate |
|---|---|---|---|
| Documentation (static) | 4,100 | 410,000 | 99.7% |
| SaaS (mixed rendering) | 3,800 | 380,000 | 99.1% |
| News/media (heavy JS, anti-bot) | 2,347 | 210,000 | 97.4% |
Documentation sites are nearly perfect. They are static HTML, they want to be crawled, and they usually have sitemaps. SaaS sites occasionally require Chrome and sometimes block aggressively during signup flows, but product pages and marketing content are generally accessible. News and media sites are the hardest: aggressive ad networks, anti-bot vendors, paywalls, and JavaScript that loads content dynamically in ways that break without a real browser.
How this compares to doing it yourself
We estimated what the same crawl would cost and how long it would take on three alternative approaches.
Scrapy on EC2
Running Scrapy on a cluster of EC2 instances to crawl 1 million pages:
| Factor | Estimate |
|---|---|
| Infrastructure | 4x c5.2xlarge instances for ~12 hours: ~$16 |
| Proxy service | 1M requests through a residential proxy (BrightData/Oxylabs): ~$300-500 |
| Development time | Writing and maintaining spiders for 10K domains: 40-80 hours |
| Chrome rendering | Splash or Playwright cluster for JS pages: additional 2x c5.2xlarge, ~$8 |
| Anti-bot handling | Manual per-site: included in dev time |
| Total infra cost | ~$330-530 |
| Total time (wall clock) | 10-16 hours (limited by proxy throughput and rate limiting) |
| Dev time cost (at $100/hr) | $4,000-8,000 |
Scrapy is free software, but the engineering time to handle 10,000 different domains, each with their own quirks, is the real cost. You will write retry logic, proxy rotation, error classification, and Chrome fallback yourself. For a one-off research project that might be fine. For a recurring pipeline, it is a maintenance burden.
Apify
Using Apify’s Web Scraper Actor or Crawlee-based custom Actor:
| Factor | Estimate |
|---|---|
| Compute units | ~2,500 CU at browser mode: ~$625-750 |
| Proxy add-on | Residential proxy: ~$300-400 additional |
| Platform fee | Scale plan at $199/mo minimum |
| Total cost | ~$1,125-1,350 |
| Total time (wall clock) | 8-14 hours |
| Dev time | Lower than Scrapy (Actors handle some complexity), but still 10-20 hours for tuning |
Apify’s managed infrastructure saves you from managing EC2, but the CU billing model means browser-heavy crawls get expensive. The per-page cost lands around $1.10-1.35 per 1,000 pages, roughly 2x Spider’s cost.
ScrapingBee
Using ScrapingBee’s API with JS rendering enabled:
| Factor | Estimate |
|---|---|
| Credits needed | 1M pages x 5 credits (JS rendering) = 5M credits |
| Plan cost | Business plan: $249/mo for 3M credits, so ~$415 for 5M |
| Stealth pages (~15% of total) | 150K pages x 75 credits = 11.25M credits: ~$937 |
| Total cost | ~$1,350 |
| Total time (wall clock) | 12-20 hours (rate limited by API concurrency) |
| Dev time | Minimal (simple API), 2-5 hours |
ScrapingBee’s credit multiplier system makes the headline price misleading. JS rendering at 5x and stealth proxy at 75x compound quickly at scale. The API itself is simple to use, but you lose that simplicity advantage when the bill arrives.
Summary
| Approach | Infrastructure cost | Estimated dev time | Wall-clock time | Notes |
|---|---|---|---|---|
| Spider | $547 | Minimal (API integration) | 3h 47m | Managed proxies and anti-bot included |
| Scrapy + EC2 | $330-530 | Significant (retry logic, proxy management, Chrome fallback) | 10-16h | Dev time varies wildly by team |
| Apify | $1,125-1,350 | Moderate (Actor selection, configuration) | 8-14h | Browser-based Actors consume CUs fast |
| ScrapingBee | $1,350 | Low (simple API) | 12-20h | Credit multipliers add up at scale |
Spider had the lowest infrastructure cost and fastest wall-clock time. Scrapy’s infrastructure cost is comparable if you already have the engineering expertise, but building equivalent retry logic and proxy management from scratch is a significant investment. We are not going to pretend we know your engineering team’s hourly rate.
Lessons learned
After running this crawl and analyzing the telemetry, here is what we would adjust for next time.
1. Batch size matters more than concurrency
We ran 50 concurrent domain crawls. Increasing to 100 did not improve throughput meaningfully because the bottleneck shifted to proxy pool utilization and per-domain rate limits. Decreasing to 25 dropped throughput by about 30%. The sweet spot for this kind of mixed-domain crawl is 40-60 concurrent jobs.
Within each job, the limit of 100 pages per domain was reasonable. For domains with fewer than 100 pages, Spider discovers this quickly via sitemaps and stops. For domains with thousands of pages, the limit prevents any single domain from monopolizing resources.
2. Use lite_mode for known-static domains
We ran everything in standard smart mode, which in retrospect was wasteful for the documentation segment. Splitting the crawl into two batches (lite mode for the 4,100 known-static doc sites, standard for the rest) would have saved roughly 40% on those domains. We did not do it because we wanted a clean single-config benchmark, but for a production pipeline with a known domain list, the split is worth the scheduling complexity.
3. Stream your output
Using application/jsonl for streaming was critical. If we had used application/json (buffered), each domain crawl would have held up to 100 pages in memory before returning. With 50 concurrent jobs, that is 5,000 pages buffered simultaneously. Streaming lets results flow to disk or your processing pipeline as they arrive.
4. Set max_credits_per_page for unpredictable domains
A small number of pages triggered expensive proxy escalation chains. Setting max_credits_per_page would have capped the cost on any single page and skipped pages that exceed your budget threshold. For our crawl, 0.3% of pages consumed 4.1% of total credits due to repeated proxy escalation. A credit cap would have flagged these for manual review instead.
5. The long tail is unavoidable
The last 10% of pages always takes disproportionately long. These are the slow servers, the intermittent failures, the domains that rate-limit aggressively. You can set timeout per request to cap how long Spider waits for a single page, but some long tail is inherent to crawling the open web. Plan for it in your timeline estimates.
6. Monitor error distribution, not just success rate
A 98.98% success rate sounds excellent, but the distribution matters. If your 1.02% failure rate is concentrated in one critical domain (say, your competitor’s documentation), you have a problem even though the aggregate number looks good. We recommend tracking per-domain success rates and flagging any domain below 95% for investigation.
What 1 million pages of markdown looks like
The 47.3 GB of output from this crawl, after deduplication and cleanup, represents a substantial corpus. To put that in context:
- It is roughly 12 billion tokens when tokenized for GPT-4 class models
- Stored as embeddings in a vector database (1536-dimensional, float32), the index would be approximately 180 GB
- At typical RAG retrieval patterns, this corpus supports a knowledge base covering 10,247 distinct web properties
For teams building RAG pipelines, training data sets, or competitive intelligence platforms, getting from “I need data from these 10,000 sites” to “I have 47 GB of clean markdown ready for embedding” in under 4 hours and for under $550 is the practical value here.
Conclusion
Crawling 1 million pages is not a theoretical exercise. It is a production workload with real failure modes, real costs, and real time pressure. The engineering challenges at this scale — DNS saturation, connection management, anti-bot escalation, rate limit backoff, memory pressure — are infrastructure problems, not application problems.
The numbers from this run:
- 3 hours 47 minutes wall-clock time
- 73.4 pages/second sustained average
- 98.98% final success rate
- $546.63 total cost ($0.55 per 1,000 pages)
- 47.3 GB of clean markdown output
Things we would do differently next time: split the domain list into static and dynamic segments upfront (saving ~40% on the static portion), add encoding normalization as a post-processing step, and implement soft-404 detection to filter parking pages and login walls from the output. The crawl ran clean, but the output still needed manual review to catch these edge cases.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.