Efficient Scraping

Web scraping with low latency request.

Sending Multiple URLs in One Request

Concatenate multiple URLs together as a string for the URL parameter to send concurrently to the crawler in one request. Useful for scraping paginated lists or scraping many URLs with the same parameter payload. Stream the responses as they are finished to help speed up processing.

Multiple URLs

params = {
	"url": "https://www.example.com, https://example2.com",
	"limit": 100,
	"request": "smart",
	"return_format": "markdown"
}

Batch Mode

Another way to send multiple URLs to be crawled at once is to use batch mode; sending a batch of parameters as an array. This is an option if you prefer to customize the parameters for each starting URL. Particularly useful for scraping multiple web pages with custom parameter payloads like parsing selectors.

Batch Mode

params = [{
        "url": "https://www.example.com",
        "limit": 5,
        "request": "chrome",
        "return_format": "markdown"
    },
    {
        "url": "https://www.example2.com/",
        "limit": 10,
        "request": "smart",
        "return_format": "markdown"
    },
    {
        "url": "https://www.example3.com/",
        "limit": 1,
        "request": "chrome",
        "return_format": "raw"
    }

Retries

Spider will automatically retry failed requests which includes retrying/rotating different proxy types (from datacenter to residential) to increase the chances of success. However, you may decide to retry a request manually. To prevent retry stacking between the crawler and your client, we recommend retrying up to 2 times and to use different parameters for request type (e.g. from http to chrome) or a different proxy type (e.g. switching to proxy_mobile or proxy_lightning). You can skip retries on hard failure status codes like 404 and 401.

Pro Tip:

Use one proxy setting per request.

Timeouts

You can set a custom timeout for each page using the request_timeout parameter. By default the timeout is set to 120 seconds. Increase the timeout if you plan on running any interactions on the page. To prevent a crawls from taking longer than expected, you may want to set crawl_timeout to a max duration for an entire crawl; a common duration is 120 seconds. However you may want to set this based on the max limit of pages you're expecting to crawl. See example below:

Crawl Timeout

params = [{
		"url": "https://www.example.com",
		"limit": 20,
		"request": "chrome",
		"return_format": "markdown",
	    "crawl_timeout": {
            "secs": 120,
            "nanos": 0
		}
	}      
}

Sometimes you might need to set a timeout in your code, and if you're using Python's requests library you can add a timeout for connecting and reading.

Client Timeout

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

CONNECTION_TIMEOUT = 15
READ_TIMEOUT = 30

params = {
	"url": "https://www.example.com",
	"limit": 30,
	"depth": 3,
	"request": "smart",
	"return_format": "markdown"
}

response = requests.post(
    'https://api.spider.cloud/crawl',
    headers=headers,
    json=params,
	stream=True, # Applies to time between consecutive data chunks in streaming mode
	timeout=(CONNECTION_TIMEOUT, READ_TIMEOUT) 
)

print(response.json())

Pro Tip:

A common tactic when sending single page URL requests is to set a client read timeout of 60 secs, and retry the request immediately if it times out. This is useful to prevent scrapes from potentially taking longer than expected and to continue processing the next request. Adjust your retries and timeout strategy based on the volume of requests and the number of pages you're scraping.

Concurrency and Streaming

Spider crawls pages quickly using full concurrency. By default all users benefit from the efficiency of running a Rust-based crawler. The crawler currently has a 50,000 request per minute limit. Learn more about streaming on our docs.