Spider Blog - Case Study: How a RAG Pipeline Went from 6 Hours to 15 Minutes

The problem

An AI research assistant needed fresh documentation from hundreds of technical sites. The product’s value depended on answering user questions with accurate, current information pulled directly from official docs, knowledge bases, and API references.

Their data pipeline was the bottleneck.

Three Python microservices handled crawling: one for URL discovery, one for rendering JavaScript-heavy pages with Playwright, and one for content extraction and markdown conversion. A proxy rotation provider sat in front of everything to avoid rate limits and bot detection. One engineer spent roughly half their time keeping this stack running.

The setup worked until it didn’t.

What broke

Anti-bot protections changed on a handful of key documentation sites. Pages that used to return clean HTML started serving CAPTCHAs or blocking requests entirely. The failure rate on those sites jumped to 30%.

The search product served stale results to paying customers for two days before the team noticed and patched the scrapers. Each fix was specific to one site’s protection mechanism. A week later, a different set of sites changed their protection, and the cycle repeated.

The proxy provider’s response was to upsell a residential proxy tier at 3x the cost. The engineering team’s response was to look for alternatives.

The before: three services and a prayer

Here’s what the architecture looked like before the migration:

Component	Purpose	Monthly cost
URL Discovery Service	Sitemap parsing, link crawling, URL deduplication	$400 (EC2)
Rendering Service	Playwright cluster for JS-heavy pages	$1,200 (EC2 + Chrome overhead)
Extraction Service	HTML-to-markdown conversion, content cleaning	$300 (EC2)
Proxy Provider	Rotating datacenter proxies, rate limit avoidance	$2,400
Engineer time	~50% of one senior engineer’s week	~$5,000 (allocated)
Total		~$9,300/month

The pipeline ran nightly. A full crawl of all target sites took about 6 hours. Pages that failed were retried once, then skipped until the next run. On a good day, the success rate was around 92%. On a bad day, it dropped below 80%.

The migration

The team replaced all three microservices with Spider’s crawl API over a single weekend. The entire pipeline collapsed to one function:

from spider import Spider
import os

client = Spider(api_key=os.getenv("SPIDER_API_KEY"))

def crawl_docs(urls: list[str]) -> list[dict]:
    results = []
    for url in urls:
        pages = client.crawl_url(url, params={
            "return_format": "markdown",
            "limit": 200,
            "request": "smart",
            "readability": True,
        })
        results.extend(pages)
    return results

No proxy configuration. No browser management. No content extraction logic. Spider handles JavaScript rendering, bot protection bypass, and markdown conversion in a single call.

The smart mode is key. Spider automatically decides per-page whether to use a lightweight HTTP fetch or a full headless browser render. Documentation sites that serve static HTML get fast HTTP requests. Sites with client-side rendering get browser-based crawling. The caller doesn’t need to know or care which path runs.

The after: one API call

Component	Purpose	Monthly cost
Spider API	Crawling, rendering, extraction, bot bypass	$180
Pipeline script	47 lines of Python, runs on existing infra	$0
Engineer time	~2 hours/month monitoring	~$250 (allocated)
Total		~$430/month

The proxy provider contract ($2,400/month) was cancelled the following week. Three repos were archived. The EC2 instances were terminated.

Performance comparison

Metric	Before	After
Full pipeline runtime	6 hours	15 minutes
Success rate (median)	92%	99.4%
Success rate on protected sites	68%	98.7%
Pages crawled per run	~45,000	~52,000
Monthly infrastructure cost	$9,300	$430
Engineering maintenance	20 hrs/week	2 hrs/month

The page count increased because Spider successfully crawled pages that the old stack was silently skipping after failed retries.

Why the pipeline got faster

The 6-hour-to-15-minute improvement comes from three factors:

Concurrent crawling. Spider processes pages concurrently on its own infrastructure. The old pipeline was limited by EC2 instance count and Playwright’s per-browser memory overhead.

Smart rendering decisions. The old pipeline ran every page through Playwright. Spider’s smart mode skips the browser for pages that don’t need it, which is most documentation sites.

No proxy bottleneck. Proxy rotation added latency on every request. Spider’s infrastructure handles bot protection internally without the round-trip penalty of an external proxy provider.

Downstream impact on RAG quality

Faster crawls with higher success rates meant the RAG pipeline’s vector database stayed current. Before the migration, some document chunks were up to a week stale because failed crawls meant skipped updates. After the migration, every target page refreshes daily.

The team measured a 12% improvement in answer relevance scores (evaluated with GPT-4 as judge) after the migration. The improvement came entirely from fresher, more complete source data.

What they would do differently

The engineering lead shared two things they learned:

Start with readability: true. Their first Spider integration skipped this parameter. The raw markdown included navigation menus, footers, and sidebar content that diluted embedding quality. Adding readability cleaned the output to just the main content, which improved chunk relevance.

Use streaming for large crawls. Their initial implementation waited for the full response. For sites with 200+ pages, this meant holding the entire result in memory. Switching to streaming (application/jsonl) reduced memory usage and let the embedding step start while pages were still being crawled.

Try it

If your team is maintaining a crawling stack and spending engineering time on scraper maintenance, the math usually works out the same way. Spider’s API handles the hard parts (rendering, bot bypass, content extraction) so your team can focus on what the data is actually for.

Start with a free account and the quickstart guide. Most teams have a working pipeline the same day.

Get started

Start crawling in 30 seconds.

One API key. No servers to manage.

Free balance on signup · No card required

Get started free

Read the docs