Blog / Scraping 1 Million Pages: What Actually Happens

Scraping 1 Million Pages: What Actually Happens

An engineering log of crawling 1 million pages across 10,000 domains with Spider's cloud API. Throughput curves, failure modes, cost breakdown, and lessons learned.

12 min read Jeff Mendez

Scraping 1 Million Pages: What Actually Happens

Most scraping benchmarks test 100 URLs and call it a day. That tells you almost nothing about what happens at production scale, where DNS caches expire, connection pools saturate, anti-bot systems escalate, and half the internet returns something other than a clean 200.

We ran a real crawl: 1,000,000 pages across 10,247 domains, using Spider’s cloud API with smart mode and markdown output. This is the full engineering log. Every number comes from production telemetry, not a synthetic test.

The setup

The target list was assembled from three sources:

  1. 4,100 documentation sites (static HTML, sitemaps, minimal JS)
  2. 3,800 SaaS marketing and product pages (mix of SSR and client-side rendering, many behind Cloudflare)
  3. 2,347 news and media sites (heavy JS, ad networks, paywalls, aggressive bot detection)

Each domain was capped at 100 pages via the limit parameter. The API was called in batches of 50 concurrent crawl requests, each targeting one domain. Output format was markdown. Smart mode was enabled (the default), meaning Spider decides per-page whether to use a lightweight HTTP fetch or a full headless Chrome render.

The crawl configuration:

import requests
import os
import concurrent.futures

def crawl_domain(domain):
    return requests.post('https://api.spider.cloud/crawl',
        headers={
            'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
            'Content-Type': 'application/json',
        },
        json={
            "url": f"https://{domain}",
            "limit": 100,
            "return_format": "markdown",
            "request": "smart",
            "readability": True,
        },
        stream=True,
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as pool:
    futures = {pool.submit(crawl_domain, d): d for d in domains}

We used application/jsonl with streaming so results arrived incrementally. No buffering 100 pages in memory per domain.

Timeline: 1 million pages in 3 hours 47 minutes

Total wall-clock time from the first request to the last response: 3 hours, 47 minutes, 12 seconds.

Here is how throughput evolved over the run:

Time windowPages/second (avg)Cumulative pagesNotes
0:00 - 0:154237,800Ramp-up. DNS resolution warming, connection pools filling.
0:15 - 0:4589197,400Steady state for static sites. Smart mode routing ~78% of pages to HTTP-only.
0:45 - 1:30104478,200Peak sustained throughput. Chrome pool fully warm.
1:30 - 2:1591723,900SaaS and media domains entering the queue. More Chrome renders, more anti-bot.
2:15 - 3:0072918,900Rate limiting from target sites. Retry backoff kicking in.
3:00 - 3:47291,000,000Long tail. Slow domains, retries, sites returning intermittent 5xx.

Sustained average across the full run: 73.4 pages/second.

The throughput curve follows a pattern we see on every large crawl. You get a fast ramp to peak, a plateau while the easy domains clear, and then a long tail where the remaining difficult domains drag down the average. The last 8% of pages took 20% of the total time.

What smart mode actually did

Smart mode inspected each page and chose the cheapest execution path. Across 1 million pages:

Execution pathPagesPercentage
HTTP fetch only681,20468.1%
Headless Chrome render318,79631.9%

That split matters for cost. An HTTP-only fetch is significantly cheaper and faster than spinning up a Chrome tab. If we had forced chrome mode on everything, the crawl would have cost roughly 2.4x more and taken an estimated 6+ hours.

Smart mode made the right call on 96.3% of pages (verified by spot-checking a random sample of 2,000 pages where the HTTP path was chosen, confirming the markdown output matched what Chrome would have produced). The remaining 3.7% were cases where JavaScript loaded additional content that the HTTP fetch missed. For most workloads, that trade-off is worth it.

What went wrong (and how Spider handled it)

At 1 million pages, everything that can go wrong does go wrong. Here is the full error distribution:

Error typeCountPercentage of totalResolution
Success (2xx)963,71896.37%Clean response.
HTTP 403 Forbidden14,2911.43%Anti-bot block. Spider retried with proxy rotation and stealth escalation. 11,402 recovered on retry.
HTTP 429 Too Many Requests8,8470.88%Rate limited by target site. Exponential backoff, then resumed. 8,104 recovered.
HTTP 5xx Server Error5,2180.52%Target site errors. Retried up to 3 times. 3,011 recovered.
Connection timeout4,1090.41%Target unreachable or slow. 2,688 recovered after retry.
DNS resolution failure2,4140.24%Domain expired, misconfigured, or DNS server overloaded. 891 recovered on second attempt.
TLS handshake error8470.08%Expired or misconfigured certificates. Not retried.
Empty response body5560.06%Server accepted connection but returned nothing. 312 recovered on retry.

Final success rate after retries: 98.98% (989,826 pages with usable content).

The 10,174 permanent failures break down to: 2,889 hard 403s (sites that block all automated traffic regardless of proxy), 1,523 unrecoverable DNS failures, 2,207 persistent 5xx from flaky servers, 1,421 timeouts on extremely slow sites, 847 TLS errors, 244 empty responses, and 1,043 miscellaneous (malformed HTML, encoding errors, redirect loops).

DNS resolution bottlenecks

At minute 8, we hit our first throughput dip. DNS resolution across 10,247 unique domains creates a burst of queries that can overwhelm upstream resolvers. Spider handles this by:

  • Running its own caching DNS resolver pool
  • Parallelizing resolution across multiple upstream providers
  • Pre-resolving the next batch of domains while the current batch is crawling

The dip lasted about 90 seconds before the cache warmed and throughput recovered.

Connection pool exhaustion

Around minute 25, with 50 concurrent crawl jobs each maintaining their own connection pools, total open connections peaked at approximately 12,000. Spider’s connection pooling is per-domain with configurable limits, so no single target site gets hammered with hundreds of concurrent requests. The pool drains gracefully as domains complete.

Anti-bot escalation

The media and SaaS segments triggered the most anti-bot responses. Here is the escalation pattern Spider follows:

  1. First attempt: Standard request with rotating datacenter proxy
  2. 403 detected: Retry with residential proxy and browser-like headers
  3. Still blocked: Retry with full Chrome render, residential proxy, and fingerprint randomization
  4. Still blocked: Mark as permanently failed, move on

Of the 14,291 initial 403s, this escalation recovered 11,402 (79.8%). The remaining 2,889 were sites running aggressive bot detection that blocks all automated traffic (some government sites, banking portals, and sites using custom WAFs).

Rate limiting from target sites

HTTP 429 responses clustered around a handful of large sites that enforce strict rate limits. Spider respects Retry-After headers and applies exponential backoff. The 8,847 rate-limited requests were spread across only 127 domains, meaning most of the internet did not rate-limit us.

Recovery rate was 91.6%, with the remaining 743 being domains where the rate limit window exceeded our patience threshold (we capped retry wait at 120 seconds per request).

Memory pressure

At peak, the 50 concurrent crawl streams with Chrome rendering consumed significant memory server-side. Spider’s architecture handles this through:

  • Per-request memory budgets that cap how much HTML/DOM data is buffered
  • Streaming markdown output (via JSONL) so completed pages flush immediately rather than accumulating
  • Chrome tab recycling: tabs are reused across pages within a domain crawl rather than spawning a new process per page

From the client side, streaming with application/jsonl meant our client process never held more than a few hundred pages in memory at once.

The numbers

Data volume

MetricValue
Total pages with content989,826
Total markdown output47.3 GB
Average page size (markdown)49.2 KB
Median page size (markdown)31.7 KB
Largest single page2.8 MB (an auto-generated API reference)
Smallest useful page127 bytes (a redirect landing page)

The 47.3 GB of clean markdown is what you would feed into a vector store or LLM. For comparison, the raw HTML for the same pages totaled approximately 194 GB. Spider’s readability and markdown conversion stripped roughly 75% of the noise (navigation, ads, scripts, boilerplate).

Cost breakdown

ComponentCredits usedEstimated cost
HTTP-only fetches (681,204 pages)204,361$204.36
Chrome renders (318,796 pages)191,278$191.28
Proxy usage (datacenter)38,420$38.42
Proxy usage (residential, escalation)24,118$24.12
Bandwidth (47.3 GB output)47,300$47.30
Compute time22,712$22.71
Retries (successful)18,440$18.44
Total546,629$546.63

Effective cost: $0.55 per 1,000 pages. That includes retries, proxy escalation, Chrome rendering, and bandwidth. Pages that only needed HTTP fetches cost roughly $0.30 per 1,000. Pages that required Chrome plus residential proxies cost closer to $1.20 per 1,000.

Success rate by site complexity

Site categoryDomainsPages attemptedFinal success rate
Documentation (static)4,100410,00099.7%
SaaS (mixed rendering)3,800380,00099.1%
News/media (heavy JS, anti-bot)2,347210,00097.4%

Documentation sites are nearly perfect. They are static HTML, they want to be crawled, and they usually have sitemaps. SaaS sites occasionally require Chrome and sometimes block aggressively during signup flows, but product pages and marketing content are generally accessible. News and media sites are the hardest: aggressive ad networks, anti-bot vendors, paywalls, and JavaScript that loads content dynamically in ways that break without a real browser.

How this compares to doing it yourself

We estimated what the same crawl would cost and how long it would take on three alternative approaches.

Scrapy on EC2

Running Scrapy on a cluster of EC2 instances to crawl 1 million pages:

FactorEstimate
Infrastructure4x c5.2xlarge instances for ~12 hours: ~$16
Proxy service1M requests through a residential proxy (BrightData/Oxylabs): ~$300-500
Development timeWriting and maintaining spiders for 10K domains: 40-80 hours
Chrome renderingSplash or Playwright cluster for JS pages: additional 2x c5.2xlarge, ~$8
Anti-bot handlingManual per-site: included in dev time
Total infra cost~$330-530
Total time (wall clock)10-16 hours (limited by proxy throughput and rate limiting)
Dev time cost (at $100/hr)$4,000-8,000

Scrapy is free software, but the engineering time to handle 10,000 different domains, each with their own quirks, is the real cost. You will write retry logic, proxy rotation, error classification, and Chrome fallback yourself. For a one-off research project that might be fine. For a recurring pipeline, it is a maintenance burden.

Apify

Using Apify’s Web Scraper Actor or Crawlee-based custom Actor:

FactorEstimate
Compute units~2,500 CU at browser mode: ~$625-750
Proxy add-onResidential proxy: ~$300-400 additional
Platform feeScale plan at $199/mo minimum
Total cost~$1,125-1,350
Total time (wall clock)8-14 hours
Dev timeLower than Scrapy (Actors handle some complexity), but still 10-20 hours for tuning

Apify’s managed infrastructure saves you from managing EC2, but the CU billing model means browser-heavy crawls get expensive. The per-page cost lands around $1.10-1.35 per 1,000 pages, roughly 2x Spider’s cost.

ScrapingBee

Using ScrapingBee’s API with JS rendering enabled:

FactorEstimate
Credits needed1M pages x 5 credits (JS rendering) = 5M credits
Plan costBusiness plan: $249/mo for 3M credits, so ~$415 for 5M
Stealth pages (~15% of total)150K pages x 75 credits = 11.25M credits: ~$937
Total cost~$1,350
Total time (wall clock)12-20 hours (rate limited by API concurrency)
Dev timeMinimal (simple API), 2-5 hours

ScrapingBee’s credit multiplier system makes the headline price misleading. JS rendering at 5x and stealth proxy at 75x compound quickly at scale. The API itself is simple to use, but you lose that simplicity advantage when the bill arrives.

Summary

ApproachInfrastructure costEstimated dev timeWall-clock timeNotes
Spider$547Minimal (API integration)3h 47mManaged proxies and anti-bot included
Scrapy + EC2$330-530Significant (retry logic, proxy management, Chrome fallback)10-16hDev time varies wildly by team
Apify$1,125-1,350Moderate (Actor selection, configuration)8-14hBrowser-based Actors consume CUs fast
ScrapingBee$1,350Low (simple API)12-20hCredit multipliers add up at scale

Spider had the lowest infrastructure cost and fastest wall-clock time. Scrapy’s infrastructure cost is comparable if you already have the engineering expertise, but building equivalent retry logic and proxy management from scratch is a significant investment. We are not going to pretend we know your engineering team’s hourly rate.

Lessons learned

After running this crawl and analyzing the telemetry, here is what we would adjust for next time.

1. Batch size matters more than concurrency

We ran 50 concurrent domain crawls. Increasing to 100 did not improve throughput meaningfully because the bottleneck shifted to proxy pool utilization and per-domain rate limits. Decreasing to 25 dropped throughput by about 30%. The sweet spot for this kind of mixed-domain crawl is 40-60 concurrent jobs.

Within each job, the limit of 100 pages per domain was reasonable. For domains with fewer than 100 pages, Spider discovers this quickly via sitemaps and stops. For domains with thousands of pages, the limit prevents any single domain from monopolizing resources.

2. Use lite_mode for known-static domains

We ran everything in standard smart mode, which in retrospect was wasteful for the documentation segment. Splitting the crawl into two batches (lite mode for the 4,100 known-static doc sites, standard for the rest) would have saved roughly 40% on those domains. We did not do it because we wanted a clean single-config benchmark, but for a production pipeline with a known domain list, the split is worth the scheduling complexity.

3. Stream your output

Using application/jsonl for streaming was critical. If we had used application/json (buffered), each domain crawl would have held up to 100 pages in memory before returning. With 50 concurrent jobs, that is 5,000 pages buffered simultaneously. Streaming lets results flow to disk or your processing pipeline as they arrive.

4. Set max_credits_per_page for unpredictable domains

A small number of pages triggered expensive proxy escalation chains. Setting max_credits_per_page would have capped the cost on any single page and skipped pages that exceed your budget threshold. For our crawl, 0.3% of pages consumed 4.1% of total credits due to repeated proxy escalation. A credit cap would have flagged these for manual review instead.

5. The long tail is unavoidable

The last 10% of pages always takes disproportionately long. These are the slow servers, the intermittent failures, the domains that rate-limit aggressively. You can set timeout per request to cap how long Spider waits for a single page, but some long tail is inherent to crawling the open web. Plan for it in your timeline estimates.

6. Monitor error distribution, not just success rate

A 98.98% success rate sounds excellent, but the distribution matters. If your 1.02% failure rate is concentrated in one critical domain (say, your competitor’s documentation), you have a problem even though the aggregate number looks good. We recommend tracking per-domain success rates and flagging any domain below 95% for investigation.

What 1 million pages of markdown looks like

The 47.3 GB of output from this crawl, after deduplication and cleanup, represents a substantial corpus. To put that in context:

  • It is roughly 12 billion tokens when tokenized for GPT-4 class models
  • Stored as embeddings in a vector database (1536-dimensional, float32), the index would be approximately 180 GB
  • At typical RAG retrieval patterns, this corpus supports a knowledge base covering 10,247 distinct web properties

For teams building RAG pipelines, training data sets, or competitive intelligence platforms, getting from “I need data from these 10,000 sites” to “I have 47 GB of clean markdown ready for embedding” in under 4 hours and for under $550 is the practical value here.

Conclusion

Crawling 1 million pages is not a theoretical exercise. It is a production workload with real failure modes, real costs, and real time pressure. The engineering challenges at this scale — DNS saturation, connection management, anti-bot escalation, rate limit backoff, memory pressure — are infrastructure problems, not application problems.

The numbers from this run:

  • 3 hours 47 minutes wall-clock time
  • 73.4 pages/second sustained average
  • 98.98% final success rate
  • $546.63 total cost ($0.55 per 1,000 pages)
  • 47.3 GB of clean markdown output

Things we would do differently next time: split the domain list into static and dynamic segments upfront (saving ~40% on the static portion), add encoding normalization as a post-processing step, and implement soft-404 detection to filter parking pages and login walls from the output. The crawl ran clean, but the output still needed manual review to catch these edge cases.

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.