Building a Fast and Resilient Web Scraper for Your RAG AI: Part 2, Scaling Up
In the first part of this series, we covered understanding what data you need, choosing your tools, and testing your scraper on a single website. Now, we will get into the challenges of scaling up your web scraping operations and adopting an error-first thinking approach. We also discuss in depth the importance of limiting the blast radius of any errors.
You can find the complete code for this project on GitHub.
Scale Up! Have a Plan For Problems
You WILL encounter problems. Scraping one web page is easy; scraping a million is anything but. Websites vary tremendously and what worked on the first 100 websites might fail on the 101st. It might also work on one website for 100 days and fail on the 101st. This is especially true for websites that use technologies like React or Drupal to supplement their HTML. Even with a great tool like Spider, errors are a fact of scraping.
A few things to keep in mind:
-
Errors: Every web scraping program I tried gave me errors occasionally; Spider is no different. What is different is that Spider’s customer support helped me handle most issues quickly. Outright errors, where an exception was thrown, were okay because at least the problem was known and could be handled by error trapping.
-
Stopping: Sometimes the API I was calling simply failed to return any data.
-
Few or no pages returned: Having some pages on a site return with a good amount of data and others not at all is especially problematic for a RAG AI where incomplete data will make for incomplete results. Scraping sites made up of many sub-sites, which is especially prevalent at organizations such as universities that are made up of many quasi-independent sub-entities, often using different web technologies, is especially prone to this. To handle this, any website that returns fewer than 500 pages is flagged for review.
-
Pages returned, but with little or no data: While sometimes an error, this is more often a picture-heavy site that has little text with the pictures. Because of this, I only consider pages with at least 75 words of text worth saving. Depending on what your RAG AI is doing, your cutoff may differ.
-
Spinning forever: Like the ‘stopping’ above, this is where the system waits for the API to return data but for whatever reason, the data never returns. This is made worse by the fact that processing web pages is a highly variable process, depending not just on the complexity of the web page being scraped, but the latency of the internet and my machines. With other scraping APIs, some pages would take minutes. Spider is quick enough that if nothing is returned after 2 minutes it’s fine to assume a problem and move on.
-
Out of memory: Scraping can be a memory-intensive exercise. My previous scraper, using Chromium directly, would run out of memory on my 32GB i9 machine. In contrast, I run Spider on a series of AWS t2.nanos, which have 0.5GB of memory and 1 CPU. This is half the memory of a $35 Raspberry Pi (!!!). It runs out of memory approximately once every 300 websites (averaging 500 pages per website). When running a t2.small, with 2 GB of memory, I’ve never run out of memory. So I simply have it so that if a site runs out of memory, the website is later picked up by a t2.small and reprocessed. The t2.nano is about ¼ the price of the t2.small (much less when free tier services are factored in).
Implementing Retry Strategies
Before diving into the error-first mindset, here is a practical retry pattern that handles the failure modes listed above. This wraps Spider’s API with exponential backoff and classifies errors so you know which ones to retry and which to skip.
import time
import hashlib
from spider import Spider
spider = Spider()
def scrape_with_retry(url, max_retries=3, base_delay=2):
"""Scrape a URL with exponential backoff and error classification."""
for attempt in range(max_retries):
try:
result = spider.scrape_url(
url,
params={
"return_format": "markdown",
"request": "smart",
}
)
if not result or len(result) == 0:
# No data returned, retry with longer timeout
if attempt < max_retries - 1:
time.sleep(base_delay * (2 ** attempt))
continue
return {"url": url, "status": "empty", "content": None}
content = result[0].get("content", "")
word_count = len(content.split())
if word_count < 75:
return {"url": url, "status": "too_short", "content": None}
return {
"url": url,
"status": "success",
"content": content,
"word_count": word_count,
"content_hash": hashlib.md5(content.encode()).hexdigest()
}
except Exception as e:
error_msg = str(e)
if "rate" in error_msg.lower() or "429" in error_msg:
# Rate limited, back off longer
time.sleep(base_delay * (2 ** attempt) * 2)
continue
if "timeout" in error_msg.lower() or "504" in error_msg:
# Timeout, retry with default backoff
time.sleep(base_delay * (2 ** attempt))
continue
# Unknown error, log and move on
return {"url": url, "status": "error", "error": error_msg}
return {"url": url, "status": "max_retries", "content": None}
The key decisions in this retry logic:
- 75-word minimum: Pages with fewer words are usually navigation pages, image galleries, or error pages. Storing them degrades RAG search quality.
- Content hashing: The MD5 hash lets you deduplicate pages that return identical content under different URLs, which is common on sites with URL parameter variations.
- Rate limit detection: A 429 response means you should slow down. Doubling the backoff on rate limits prevents cascading failures.
- Timeout vs. error distinction: Timeouts are often transient (network congestion, slow backend), while other errors (403, 404) usually persist. Retrying timeouts makes sense. Retrying a 403 does not.
Timeout Handling
One of the trickier failure modes is the “spinning forever” problem. Spider is fast enough that a 2-minute timeout per site is reasonable. Here is how to enforce that at the application level:
import signal
class TimeoutError(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutError("Scrape exceeded time limit")
def scrape_site_with_timeout(url, limit=500, timeout_seconds=120):
"""Crawl a full site with a hard timeout."""
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(timeout_seconds)
try:
results = spider.crawl_url(
url,
params={
"return_format": "markdown",
"limit": limit,
"request": "smart",
}
)
signal.alarm(0) # Cancel the alarm
return results
except TimeoutError:
return {"url": url, "status": "timeout", "pages": []}
except Exception as e:
signal.alarm(0)
return {"url": url, "status": "error", "error": str(e)}
For production workloads, Spider’s streaming mode is a better approach. Instead of waiting for the entire crawl to finish, you process pages as they arrive:
results = spider.crawl_url(
url,
params={
"return_format": "markdown",
"limit": 500,
"request": "smart",
},
stream=True
)
for page in results:
process_page(page)
Streaming solves the timeout problem at its root. You never wait for the full result. If a crawl stalls after returning 400 of 500 pages, you already have those 400 pages processed and stored.
Error First Thinking. Expect Some Cleanup
Since errors are inevitable, the important thing is to know when they happen and handle them gracefully.
To ease handling, I assume each webpage is an error until proven otherwise. This dramatically improves my processing as it encouraged me to build in the following features:
-
Error logging/Trapping: All good systems include error logging and trapping. I built this in from the beginning. In this case, it was only the start of a larger fault-tolerant architecture.
-
On Error, flag it and continue to the next: I’m a big fan of Toyota’s stop-the-line processing where when errors happen, the entire processing line stops and errors get resolved before any other processing is allowed to occur. That said, I’m a bigger fan of sleep. Stopping all processing when an issue occurs meant I woke up many mornings to completely stopped processing, often having lost 40 threads worth of processing for many hours.
Back before the speed of Spider, I projected I would need 4 months of processing time for 40 concurrent threads to complete my scraping. Losing 6 or 8 hours meant delaying launch by another day. This prompted me to just flag any errors and move on, allowing processing to continue. This did backfire a few times when an error would pop up that would propagate through the threads, causing lots of lost work and cleanup.
-
Have a separate process to track and handle errors: Delaying the review of errors until the morning didn’t mean ignoring them, far from it. It did mean I had to have time set aside to identify and work through all the errors from the previous night. Any errors not resolved, and the root cause mitigated, will just pop up again and again until it is resolved, so it’s vitally important to have these looked at.
-
Be careful about concurrency: When multiple processes or threads are running, ensure that any website is only worked on by one process. I used a simple processing flag in MongoDB to handle this.
-
Limit Blast Radius, so when problems happen they are isolated: This one is so important, it deserves its own section:
Separate Processing of Requests: Limited Blast Radius
Originally, I had a nice multi-threaded scraper running on my beefy server. Again and again, I was hitting unexpected issues that would bring the entire system down. Be it a memory issue or a threading problem, issues I expected would be trapped and isolated instead impacted the entire system, meaning that an error on one thread would compromise the entire system. This meant that my beefy server that could scrape 40 websites at a time was actually a liability instead of an asset. Having architected many production systems, there are ways to architect around this using clusters, microservices, and the like, but this is a background process run infrequently and an outage doesn’t have the impact of a customer-facing system facing a failure. I needed a simpler, cheaper solution.
As previously mentioned, one problem with most of the other scrapers I tried was that they would bring up a browser to display every webpage being scraped. Being able to scrape a site without bringing up a browser is called running “headless.” I certainly tried running Chromium headless and had some success, but I encountered many new additional errors attempting to run headless. In particular, sites seemed more able to detect that I was scraping and would prevent me from accessing the site.
Running with the browser appearing, at first, was a benefit as I could see firsthand what the system was doing. It became a liability because all of those browser windows popping up on my server made doing anything else with the server difficult as it would occasionally grab focus of the system. There’s nothing quite like typing code and then being thrust onto a website mid-sentence. Playing a game on my computer while running the scraping was completely out of the question.
A bigger issue with having the browser pop up was that I could not use low-cost cloud computing to do this work as the lowest-cost cloud services don’t have a user interface. If I could find a headless, low-memory option, I could use these cloud services and then I could just “throw servers” at the problem. But without a headless option, that would be very expensive. Worse, managing a CLI-only t2.nano is simplicity. Managing a Windows GUI server is far more complex. Spider being able to run well on these machines was a game-changer.
As I mentioned above, these tiny servers would occasionally hit errors. Unfortunately, scraping websites will never be a “fire and forget” exercise. It will take constant oversight as technology changes to make sure everything keeps running correctly and quick intervention when it doesn’t. The internet and targeted websites are just changing so much that constant vigilance is required.
The fact that Spider ran headless with much lower memory requirements meant I could spin up 15 low-cost t2.nanos and have each run the scraper single-threaded. These nanos are actually on AWS’ free tier. My total cloud server cost to handle 1.2 million web pages was less than the cost of a cappuccino at Starbucks! My Spider costs did run several hundred dollars, but far less than the $1,000 in AWS costs I had originally planned and would have needed had I continued with the Chromium route.
Production Deployment Patterns
Troy’s “spider legion” of t2.nano instances is a proven architecture for RAG scraping at scale. Here is a more detailed look at how to structure it.
The fleet architecture
The idea is simple: instead of one powerful machine running 40 threads, run 15 cheap machines each running a single thread. When one machine hits an out-of-memory error, only that one job is lost. The other 14 keep running.
┌─────────────┐
│ Job Queue │ MongoDB / Redis / SQS
│ (URLs to │
│ scrape) │
└──────┬──────┘
│
├──── t2.nano #1 (0.5 GB RAM, single-threaded)
├──── t2.nano #2
├──── t2.nano #3
│ ...
├──── t2.nano #15
│
└──── t2.small #1 (2 GB RAM, fallback for heavy sites)
Each worker pulls a URL from the queue, scrapes it, stores the result, and pulls the next one. If a worker crashes, the URL stays in the queue and another worker picks it up.
Worker script
A minimal worker that pulls from a job queue and writes results:
import os
import json
from spider import Spider
from pymongo import MongoClient
spider = Spider()
db = MongoClient(os.getenv("MONGO_URI")).rag_scraper
def worker_loop():
"""Pull URLs from queue, scrape, store results."""
while True:
# Atomically claim a job
job = db.jobs.find_one_and_update(
{"status": "pending"},
{"$set": {"status": "processing", "worker": os.getenv("HOSTNAME")}},
)
if not job:
break # Queue empty, exit
url = job["url"]
result = scrape_with_retry(url)
if result["status"] == "success":
db.pages.insert_one({
"url": url,
"content": result["content"],
"word_count": result["word_count"],
"content_hash": result["content_hash"],
})
db.jobs.update_one(
{"_id": job["_id"]},
{"$set": {"status": "done"}}
)
else:
db.jobs.update_one(
{"_id": job["_id"]},
{"$set": {"status": "failed", "error": result.get("error", result["status"])}}
)
if __name__ == "__main__":
worker_loop()
The find_one_and_update call is the key pattern. It atomically claims a job so no two workers process the same URL. This is Troy’s “processing flag” approach from the original article, implemented with MongoDB’s atomic operations.
Cost at scale
Here is what Troy’s fleet actually costs for 1.2 million pages:
| Resource | Count | Unit cost | Monthly total |
|---|---|---|---|
| t2.nano (free tier) | 15 | $0.00 | $0.00 |
| t2.small (fallback) | 1 | ~$0.023/hr | ~$17 |
| Spider API credits | - | ~$0.0003/page | ~$360 |
| MongoDB (free tier) | 1 | $0.00 | $0.00 |
| Total | ~$377 |
Compare this to running Chromium directly on larger instances: the same workload would cost $1,000+ in AWS compute alone, before any API costs.
Monitoring Your Fleet
When you have 15 workers running overnight, you need visibility into what happened while you slept. A simple monitoring approach:
def generate_daily_report():
"""Generate a summary of overnight scraping activity."""
pipeline = [
{"$group": {
"_id": "$status",
"count": {"$sum": 1}
}}
]
status_counts = {r["_id"]: r["count"] for r in db.jobs.aggregate(pipeline)}
# Sites with suspiciously few pages
low_page_sites = db.pages.aggregate([
{"$group": {"_id": "$site", "page_count": {"$sum": 1}}},
{"$match": {"page_count": {"$lt": 500}}},
{"$sort": {"page_count": 1}},
])
return {
"total_processed": status_counts.get("done", 0),
"failures": status_counts.get("failed", 0),
"pending": status_counts.get("pending", 0),
"low_page_sites": list(low_page_sites),
"failure_rate": status_counts.get("failed", 0) / max(sum(status_counts.values()), 1),
}
The critical metric is the failure rate. Troy’s experience showed that Spider’s error rate is low enough that a failure rate above 5% usually indicates a systemic problem (network issue, API key problem, etc.) rather than normal website variation.
Data Quality Validation
For RAG pipelines, data quality matters more than quantity. Bad data in your vector store produces bad answers. Here are the validation checks worth running on every scraped page:
def validate_for_rag(page):
"""Check if a scraped page is suitable for RAG ingestion."""
content = page.get("content", "")
# Minimum content threshold
word_count = len(content.split())
if word_count < 75:
return False, "too_short"
# Detect boilerplate-heavy pages
# If the same content appears on 10+ pages from the same site,
# it is likely a template or navigation element
duplicate_count = db.pages.count_documents({
"content_hash": page["content_hash"],
"site": page["site"]
})
if duplicate_count > 10:
return False, "duplicate_boilerplate"
# Detect error pages that returned 200
error_indicators = [
"page not found",
"404",
"access denied",
"please enable javascript",
]
content_lower = content.lower()
for indicator in error_indicators:
if indicator in content_lower and word_count < 200:
return False, f"likely_error_page:{indicator}"
return True, "valid"
Content Deduplication at Scale
When scraping 1.2 million pages, you will encounter significant duplication. University websites are a common source of this, where the same course catalog or department description appears under multiple URLs.
The content hash approach from the retry function handles exact duplicates. For near-duplicates (same content with minor variations like timestamps or session IDs), a simple approach is to compare the first 500 characters:
def get_near_duplicate_key(content):
"""Generate a key for near-duplicate detection."""
# Strip whitespace variations and take first 500 chars
normalized = " ".join(content.split())[:500]
return hashlib.md5(normalized.encode()).hexdigest()
For Troy’s project, exact deduplication alone eliminated roughly 15% of scraped pages. Near-duplicate detection caught another 5-8%. That is 20%+ of storage and embedding costs saved, and more importantly, 20% less noise in RAG search results.
Conclusion
Building a resilient, fault-tolerant web scraper is crucial for the success of a scalable RAG AI. The journey from scraping a few pages to handling millions is fraught with challenges, but with the right tools and strategies, it becomes manageable. The key is to plan, choose your tools wisely, and be prepared for the inevitable issues that will arise.
Having my “spider legion” of 15 AWS t2.nano servers, backed by a single AWS t2.small server for the very rare high-memory website, I was able to complete in a week what I had expected would take four months (if all went well!). Spider’s low memory overhead combined with it being headless meant I could run it massively in parallel. This setup allowed for efficient, cost-effective scraping at scale.
Key Takeaways
- Understand Your Data Needs: Always aim to collect only the essential data to maintain high-quality semantic searches.
- Choose Reliable Tools: Tools like Spider for scraping can significantly streamline your process.
- Plan for Errors: Implement robust error logging and handling mechanisms to ensure your scraping process is resilient.
- Implement Retry Logic: Use exponential backoff with error classification. Retry timeouts and rate limits. Skip persistent errors like 403s and 404s.
- Validate Data Quality: Filter out short pages, boilerplate, and soft error pages before they enter your vector store.
- Deduplicate Aggressively: Content hashing catches exact duplicates. Normalized prefix hashing catches near-duplicates. Both save storage costs and improve RAG accuracy.
- Optimize Storage: Use strategies like hashing to eliminate duplicates and store only the necessary data.
- Parallel Processing: A tool like Spider allows you to use cloud services to run multiple scraping instances in parallel, which can drastically reduce the time required for large-scale scraping projects.
- Monitor Overnight Runs: Track failure rates, low-page-count sites, and worker health. A failure rate above 5% usually means a systemic problem.
By leveraging these strategies, you can build a scalable, efficient, and resilient web scraping system that forms a robust foundation for your RAG AI. As the landscape of web technologies continues to evolve, staying adaptive and prepared for new challenges will be essential for ongoing success.
You can find the complete code for this project on Troy’s GitHub.
- Author: Troy Lowry
- Twitter: @Troyusrex
- Read more: lowryonleadership.com