Reliability
Crawling 10 pages over a stable connection is easy. Crawling 10,000 pages from a cron job on a flaky network is a different story. This page covers battle-tested ways to make sure every page actually reaches your infrastructure — even when things go wrong.
Why Reliability Matters
An HTTP stream can drop mid-crawl. Your server can restart. A deploy can kill the process reading results. When that happens, you don't want to re-crawl hundreds of pages you already had. Spider handles the hard parts — anti-detect browsers that get past bot protection, rotating proxies, headless rendering — but none of that matters if the results never reach you. Spider's data connectors and webhooks run server-side, so your data lands safely regardless of what happens to your client. Combine them with streaming for real-time processing with a safety net.
Stream + Data Connector
The simplest upgrade: stream JSONL so you can process pages as they arrive, and attach a data connector (S3, Supabase, etc.) so Spider also writes every page server-side. If your connection drops halfway through, you haven't lost anything — the connector already has the data.
Stream JSONL + S3 backup
import requests, os, json
headers = {
"Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
"Content-Type": "application/jsonl",
}
response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
"url": "https://example.com",
"limit": 100,
"return_format": "markdown",
"data_connectors": {
"s3": {
"bucket": "my-crawl-data",
"access_key_id": os.getenv("AWS_ACCESS_KEY_ID"),
"secret_access_key": os.getenv("AWS_SECRET_ACCESS_KEY"),
"region": "us-west-2",
"prefix": "crawls/"
},
"on_find": True
}
}, stream=True)
# Process in real time — S3 has your backup
for line in response.iter_lines():
if line:
page = json.loads(line)
print(f"Got {page['url']}")Ctrl+C the client, the connector keeps writing. Spider sends data to your connector server-side, independent of your HTTP stream. Setup details for each provider are in the Data Connectors guide.Background Crawl + Data Connector + Webhook
For large or scheduled crawls, you probably don't want to hold a connection open at all. Set run_in_background to true, point a data connector at your storage, and add an on_website_status webhook. The API returns immediately. Pages accumulate in your database or bucket while Spider works. When the crawl finishes, you get a webhook — go process from your DB at your own pace. This is the pattern you want for cron jobs, large batch crawls, and anything that runs unattended.
Background crawl + Supabase + webhook
import requests, os
headers = {
"Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
"Content-Type": "application/json",
}
response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
"url": "https://example.com",
"limit": 500,
"return_format": "markdown",
"run_in_background": True,
"data_connectors": {
"supabase": {
"url": "https://your-project.supabase.co",
"anon_key": os.getenv("SUPABASE_ANON_KEY"),
"table": "crawled_pages"
},
"on_find": True
},
"webhook": {
"url": "https://your-server.com/crawl-done",
"on_website_status": True
}
})
# Returns immediately with a crawl_id
print(response.json())Flow
Request ──► Spider crawls in background
│
├──► Pages land in Supabase as they're found
│
└──► Webhook fires on completion
│
└──► Your app processes from DBJSONL Streaming with Checkpointing
Sometimes you want the immediacy of streaming but can't afford to lose progress on disconnect. The idea is straightforward: keep a set of URLs you've already processed. If the stream drops, re-request with a reduced limit covering only what's left. No server-side state needed — your client tracks everything.
Streaming with checkpoint recovery
import requests, os, json
API = "https://api.spider.cloud/crawl"
HEADERS = {
"Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
"Content-Type": "application/jsonl",
}
def crawl_with_checkpoint(url: str, limit: int):
processed = set()
while len(processed) < limit:
try:
resp = requests.post(API, headers=HEADERS, json={
"url": url,
"limit": limit - len(processed),
"return_format": "markdown",
}, stream=True)
for line in resp.iter_lines():
if line:
page = json.loads(line)
if page["url"] not in processed:
processed.add(page["url"])
yield page # hand off to your processing pipeline
except requests.exceptions.ConnectionError:
if not processed:
raise # first attempt failed — bail
print(f"Disconnected after {len(processed)}/{limit} pages, resuming...")
continue
crawl_with_checkpoint("https://example.com", limit=100)len(processed) against your original limit to know when you're done. For extra safety, add a data connector too — then you get real-time streaming and a durable backup with zero extra code on reconnect.Webhook-Driven Queue Pipeline
If your backend is already event-driven, lean into it. Enable the on_find webhook and have your receiver push each page straight into a queue — SQS, Redis Streams, RabbitMQ, whatever you already run. Crawling and processing become fully decoupled: Spider discovers pages, your queue buffers them, your workers consume at their own pace.
Crawl with on_find webhook
import requests, os
headers = {
"Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
"Content-Type": "application/json",
}
response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
"url": "https://example.com",
"limit": 200,
"return_format": "markdown",
"webhook": {
"url": "https://your-server.com/spider-webhook",
"on_find": True,
"on_website_status": True
}
})
print(response.json())Webhook receiver → SQS
from fastapi import FastAPI, Request
import boto3, json
app = FastAPI()
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/spider-pages"
@app.post("/spider-webhook")
async def handle_webhook(request: Request):
payload = await request.json()
# Push to SQS for async processing
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps(payload),
)
return {"ok": True} # return 200 fast200 immediately after enqueuing — don't do heavy processing inline or Spider will time out waiting for your response.Choosing a Pattern
These aren't mutually exclusive. A common setup is streaming with a connector (so you get real-time output plus a durable backup) combined with a webhook queue for async post-processing. Start with whichever matches your current stack and layer on more as needed.
| Need | Pattern |
|---|---|
| Real-time + guaranteed delivery | Stream + Connector |
| Large batch / cron jobs | Background + Connector + Webhook |
| Interactive with disconnect recovery | JSONL Checkpointing |
| Microservice / event-driven | Webhook Queue |