The problem
An AI research assistant needed fresh documentation from hundreds of technical sites. The product’s value depended on answering user questions with accurate, current information pulled directly from official docs, knowledge bases, and API references.
Their data pipeline was the bottleneck.
Three Python microservices handled crawling: one for URL discovery, one for rendering JavaScript-heavy pages with Playwright, and one for content extraction and markdown conversion. A proxy rotation provider sat in front of everything to avoid rate limits and bot detection. One engineer spent roughly half their time keeping this stack running.
The setup worked until it didn’t.
What broke
Anti-bot protections changed on a handful of key documentation sites. Pages that used to return clean HTML started serving CAPTCHAs or blocking requests entirely. The failure rate on those sites jumped to 30%.
The search product served stale results to paying customers for two days before the team noticed and patched the scrapers. Each fix was specific to one site’s protection mechanism. A week later, a different set of sites changed their protection, and the cycle repeated.
The proxy provider’s response was to upsell a residential proxy tier at 3x the cost. The engineering team’s response was to look for alternatives.
The before: three services and a prayer
Here’s what the architecture looked like before the migration:
| Component | Purpose | Monthly cost |
|---|---|---|
| URL Discovery Service | Sitemap parsing, link crawling, URL deduplication | $400 (EC2) |
| Rendering Service | Playwright cluster for JS-heavy pages | $1,200 (EC2 + Chrome overhead) |
| Extraction Service | HTML-to-markdown conversion, content cleaning | $300 (EC2) |
| Proxy Provider | Rotating datacenter proxies, rate limit avoidance | $2,400 |
| Engineer time | ~50% of one senior engineer’s week | ~$5,000 (allocated) |
| Total | ~$9,300/month |
The pipeline ran nightly. A full crawl of all target sites took about 6 hours. Pages that failed were retried once, then skipped until the next run. On a good day, the success rate was around 92%. On a bad day, it dropped below 80%.
The migration
The team replaced all three microservices with Spider’s crawl API over a single weekend. The entire pipeline collapsed to one function:
from spider import Spider
import os
client = Spider(api_key=os.getenv("SPIDER_API_KEY"))
def crawl_docs(urls: list[str]) -> list[dict]:
results = []
for url in urls:
pages = client.crawl_url(url, params={
"return_format": "markdown",
"limit": 200,
"request": "smart",
"readability": True,
})
results.extend(pages)
return results
No proxy configuration. No browser management. No content extraction logic. Spider handles JavaScript rendering, bot protection bypass, and markdown conversion in a single call.
The smart mode is key. Spider automatically decides per-page whether to use a lightweight HTTP fetch or a full headless browser render. Documentation sites that serve static HTML get fast HTTP requests. Sites with client-side rendering get browser-based crawling. The caller doesn’t need to know or care which path runs.
The after: one API call
| Component | Purpose | Monthly cost |
|---|---|---|
| Spider API | Crawling, rendering, extraction, bot bypass | $180 |
| Pipeline script | 47 lines of Python, runs on existing infra | $0 |
| Engineer time | ~2 hours/month monitoring | ~$250 (allocated) |
| Total | ~$430/month |
The proxy provider contract ($2,400/month) was cancelled the following week. Three repos were archived. The EC2 instances were terminated.
Performance comparison
| Metric | Before | After |
|---|---|---|
| Full pipeline runtime | 6 hours | 15 minutes |
| Success rate (median) | 92% | 99.4% |
| Success rate on protected sites | 68% | 98.7% |
| Pages crawled per run | ~45,000 | ~52,000 |
| Monthly infrastructure cost | $9,300 | $430 |
| Engineering maintenance | 20 hrs/week | 2 hrs/month |
The page count increased because Spider successfully crawled pages that the old stack was silently skipping after failed retries.
Why the pipeline got faster
The 6-hour-to-15-minute improvement comes from three factors:
Concurrent crawling. Spider processes pages concurrently on its own infrastructure. The old pipeline was limited by EC2 instance count and Playwright’s per-browser memory overhead.
Smart rendering decisions. The old pipeline ran every page through Playwright. Spider’s smart mode skips the browser for pages that don’t need it, which is most documentation sites.
No proxy bottleneck. Proxy rotation added latency on every request. Spider’s infrastructure handles bot protection internally without the round-trip penalty of an external proxy provider.
Downstream impact on RAG quality
Faster crawls with higher success rates meant the RAG pipeline’s vector database stayed current. Before the migration, some document chunks were up to a week stale because failed crawls meant skipped updates. After the migration, every target page refreshes daily.
The team measured a 12% improvement in answer relevance scores (evaluated with GPT-4 as judge) after the migration. The improvement came entirely from fresher, more complete source data.
What they would do differently
The engineering lead shared two things they learned:
Start with readability: true. Their first Spider integration skipped this parameter. The raw markdown included navigation menus, footers, and sidebar content that diluted embedding quality. Adding readability cleaned the output to just the main content, which improved chunk relevance.
Use streaming for large crawls. Their initial implementation waited for the full response. For sites with 200+ pages, this meant holding the entire result in memory. Switching to streaming (application/jsonl) reduced memory usage and let the embedding step start while pages were still being crawled.
Try it
If your team is maintaining a crawling stack and spending engineering time on scraper maintenance, the math usually works out the same way. Spider’s API handles the hard parts (rendering, bot bypass, content extraction) so your team can focus on what the data is actually for.
Start with a free account and the quickstart guide. Most teams have a working pipeline the same day.