Efficient Scraping

Techniques for reducing latency, minimizing credit usage, and handling high-volume workloads. These patterns are essential when you're scraping thousands of pages or building production pipelines where every second matters.

Sending Multiple URLs in One Request

Pass multiple URLs as a comma-separated string in the url parameter. Spider processes them concurrently in a single request — no need to make separate API calls for each page. This is ideal for scraping paginated lists or known URL sets where every page shares the same configuration. Combine with streaming to process results as they finish.

Multiple URLs

params = { "url": "https://www.example.com, https://example2.com", "limit": 100, "request": "smart", "return_format": "markdown" }

Batch Mode

When each URL needs different parameters — different limit, request mode, or return_format — send an array of parameter objects instead. Each entry is processed independently with its own settings. This is useful when scraping heterogeneous sources in a single API call.

Batch Mode

params = [{ "url": "https://www.example.com", "limit": 5, "request": "chrome", "return_format": "markdown" }, { "url": "https://www.example2.com/", "limit": 10, "request": "smart", "return_format": "markdown" }, { "url": "https://www.example3.com/", "limit": 1, "request": "chrome", "return_format": "raw" }

Retries

Spider automatically retries failed requests and rotates proxy types (datacenter to residential) to improve success rates. If you retry manually, limit it to 2 attempts and vary your settings — switch from http to chrome, or change the proxy type to mobile or residential. Skip retries on hard failures like 404 and 401.

Timeouts

Set request_timeout to control how long Spider waits for each page (default: 120 seconds). Increase it for pages with interactions. Use crawl_timeout to cap the total duration of an entire crawl — set it based on your page limit and expected crawl size:

Crawl Timeout

params = [{ "url": "https://www.example.com", "limit": 20, "request": "chrome", "return_format": "markdown", "crawl_timeout": { "secs": 120, "nanos": 0 } } }

You can also set client-side timeouts. With Python's requests library, pass separate connection and read timeouts:

Client Timeout

import requests, os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } CONNECTION_TIMEOUT = 15 READ_TIMEOUT = 30 params = { "url": "https://www.example.com", "limit": 30, "depth": 3, "request": "smart", "return_format": "markdown" } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params, stream=True, # Applies to time between consecutive data chunks in streaming mode timeout=(CONNECTION_TIMEOUT, READ_TIMEOUT) ) print(response.json())

Concurrency and Streaming

Spider's Rust-based engine processes pages concurrently, with a 50,000 request per minute limit. Pair with streaming to process results as they arrive.