Reliability

Crawling 10 pages over a stable connection is easy. Crawling 10,000 pages from a cron job on a flaky network is a different story. This page covers battle-tested ways to make sure every page actually reaches your infrastructure — even when things go wrong.

Why Reliability Matters

An HTTP stream can drop mid-crawl. Your server can restart. A deploy can kill the process reading results. When that happens, you don't want to re-crawl hundreds of pages you already had. Spider handles the hard parts — anti-detect browsers that get past bot protection, rotating proxies, headless rendering — but none of that matters if the results never reach you. Spider's data connectors and webhooks run server-side, so your data lands safely regardless of what happens to your client. Combine them with streaming for real-time processing with a safety net.

Stream + Data Connector

The simplest upgrade: stream JSONL so you can process pages as they arrive, and attach a data connector (S3, Supabase, etc.) so Spider also writes every page server-side. If your connection drops halfway through, you haven't lost anything — the connector already has the data.

Stream JSONL + S3 backup

import requests, os, json headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/jsonl", } response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 100, "return_format": "markdown", "data_connectors": { "s3": { "bucket": "my-crawl-data", "access_key_id": os.getenv("AWS_ACCESS_KEY_ID"), "secret_access_key": os.getenv("AWS_SECRET_ACCESS_KEY"), "region": "us-west-2", "prefix": "crawls/" }, "on_find": True } }, stream=True) # Process in real time — S3 has your backup for line in response.iter_lines(): if line: page = json.loads(line) print(f"Got {page['url']}")

Background Crawl + Data Connector + Webhook

For large or scheduled crawls, you probably don't want to hold a connection open at all. Set run_in_background to true, point a data connector at your storage, and add an on_website_status webhook. The API returns immediately. Pages accumulate in your database or bucket while Spider works. When the crawl finishes, you get a webhook — go process from your DB at your own pace. This is the pattern you want for cron jobs, large batch crawls, and anything that runs unattended.

Background crawl + Supabase + webhook

import requests, os headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/json", } response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 500, "return_format": "markdown", "run_in_background": True, "data_connectors": { "supabase": { "url": "https://your-project.supabase.co", "anon_key": os.getenv("SUPABASE_ANON_KEY"), "table": "crawled_pages" }, "on_find": True }, "webhook": { "url": "https://your-server.com/crawl-done", "on_website_status": True } }) # Returns immediately with a crawl_id print(response.json())

Flow

Request ──► Spider crawls in background │ ├──► Pages land in Supabase as they're found │ └──► Webhook fires on completion │ └──► Your app processes from DB

JSONL Streaming with Checkpointing

Sometimes you want the immediacy of streaming but can't afford to lose progress on disconnect. The idea is straightforward: keep a set of URLs you've already processed. If the stream drops, re-request with a reduced limit covering only what's left. No server-side state needed — your client tracks everything.

Streaming with checkpoint recovery

import requests, os, json API = "https://api.spider.cloud/crawl" HEADERS = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/jsonl", } def crawl_with_checkpoint(url: str, limit: int): processed = set() while len(processed) < limit: try: resp = requests.post(API, headers=HEADERS, json={ "url": url, "limit": limit - len(processed), "return_format": "markdown", }, stream=True) for line in resp.iter_lines(): if line: page = json.loads(line) if page["url"] not in processed: processed.add(page["url"]) yield page # hand off to your processing pipeline except requests.exceptions.ConnectionError: if not processed: raise # first attempt failed — bail print(f"Disconnected after {len(processed)}/{limit} pages, resuming...") continue crawl_with_checkpoint("https://example.com", limit=100)

Webhook-Driven Queue Pipeline

If your backend is already event-driven, lean into it. Enable the on_find webhook and have your receiver push each page straight into a queue — SQS, Redis Streams, RabbitMQ, whatever you already run. Crawling and processing become fully decoupled: Spider discovers pages, your queue buffers them, your workers consume at their own pace.

Crawl with on_find webhook

import requests, os headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/json", } response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 200, "return_format": "markdown", "webhook": { "url": "https://your-server.com/spider-webhook", "on_find": True, "on_website_status": True } }) print(response.json())

Webhook receiver → SQS

from fastapi import FastAPI, Request import boto3, json app = FastAPI() sqs = boto3.client("sqs") QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/spider-pages" @app.post("/spider-webhook") async def handle_webhook(request: Request): payload = await request.json() # Push to SQS for async processing sqs.send_message( QueueUrl=QUEUE_URL, MessageBody=json.dumps(payload), ) return {"ok": True} # return 200 fast

Choosing a Pattern

These aren't mutually exclusive. A common setup is streaming with a connector (so you get real-time output plus a durable backup) combined with a webhook queue for async post-processing. Start with whichever matches your current stack and layer on more as needed.

NeedPattern
Real-time + guaranteed deliveryStream + Connector
Large batch / cron jobsBackground + Connector + Webhook
Interactive with disconnect recoveryJSONL Checkpointing
Microservice / event-drivenWebhook Queue