Efficient Scraping
Techniques for reducing latency, minimizing credit usage, and handling high-volume workloads. These patterns are essential when you're scraping thousands of pages or building production pipelines where every second matters.
Sending Multiple URLs in One Request
Pass multiple URLs as a comma-separated string in the url parameter. Spider processes them concurrently in a single request — no need to make separate API calls for each page. This is ideal for scraping paginated lists or known URL sets where every page shares the same configuration. Combine with streaming to process results as they finish.
Multiple URLs
params = {
"url": "https://www.example.com, https://example2.com",
"limit": 100,
"request": "smart",
"return_format": "markdown"
}Batch Mode
When each URL needs different parameters — different limit, request mode, or return_format — send an array of parameter objects instead. Each entry is processed independently with its own settings. This is useful when scraping heterogeneous sources in a single API call.
Batch Mode
params = [{
"url": "https://www.example.com",
"limit": 5,
"request": "chrome",
"return_format": "markdown"
},
{
"url": "https://www.example2.com/",
"limit": 10,
"request": "smart",
"return_format": "markdown"
},
{
"url": "https://www.example3.com/",
"limit": 1,
"request": "chrome",
"return_format": "raw"
}Retries
Spider automatically retries failed requests and rotates proxy types (datacenter to residential) to improve success rates. If you retry manually, limit it to 2 attempts and vary your settings — switch from http to chrome, or change the proxy type to mobile or residential. Skip retries on hard failures like 404 and 401.
Timeouts
Set request_timeout to control how long Spider waits for each page (default: 120 seconds). Increase it for pages with interactions. Use crawl_timeout to cap the total duration of an entire crawl — set it based on your page limit and expected crawl size:
Crawl Timeout
params = [{
"url": "https://www.example.com",
"limit": 20,
"request": "chrome",
"return_format": "markdown",
"crawl_timeout": {
"secs": 120,
"nanos": 0
}
}
}You can also set client-side timeouts. With Python's requests library, pass separate connection and read timeouts:
Client Timeout
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
CONNECTION_TIMEOUT = 15
READ_TIMEOUT = 30
params = {
"url": "https://www.example.com",
"limit": 30,
"depth": 3,
"request": "smart",
"return_format": "markdown"
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params,
stream=True, # Applies to time between consecutive data chunks in streaming mode
timeout=(CONNECTION_TIMEOUT, READ_TIMEOUT)
)
print(response.json())60 secs, and retry the request immediately if it times out. This is useful to prevent scrapes from potentially taking longer than expected and to continue processing the next request. Adjust your retries and timeout strategy based on the volume of requests and the number of pages you're scraping.Concurrency and Streaming
Spider's Rust-based engine processes pages concurrently, with a 50,000 request per minute limit. Pair with streaming to process results as they arrive.