Efficient Scraping

Sending Multiple URLs in One Request

Concatenate multiple urls together as a string for the URL parameter to send concurrently to the crawler in one request. Useful for scraping paginated lists or scraping many URLs with the same parameter payload. Stream the responses as they are finished to help speed up processing.

Multiple URLs

params = { "url": "https://www.example.com, https://example2.com", "limit": 100, "request": "smart", "return_format": "markdown" }

Batch Mode

Another way to send multiple URLs to be crawled at once is to use batch mode; sending a batch of parameters as an array. This is an option if you prefer to customize the parameters for each starting URL. Particularly useful for scraping multiple web pages with custom parameter payloads like parsing selectors.

Batch Mode

params = [{ "url": "https://www.example.com", "limit": 5, "request": "chrome", "return_format": "markdown" }, { "url": "https://www.example2.com/", "limit": 10, "request": "smart", "return_format": "markdown" }, { "url": "https://www.example3.com/", "limit": 1, "request": "chrome", "return_format": "raw" }

Retries

Spider will automatically retry failed requests which includes retrying/rotating different proxy types (from datacenter to residential) to increase the chances of success. However, you may decide to retry a request manually. To prevent retry stacking between the crawler and your client, we recommend retrying up to 2 times and to use different parameters for request type (e.g. from http to chrome) or a different proxy type (e.g. switching to proxy_mobile or proxy_lightning). You can skip retries on hard failure error codes like 404, 401 etc.

Note: Use one proxy setting per request.

Timeouts

You can set a custom timeout for each page using the request_timeout parameter. The default is 60 seconds. Increase the timeout if you plan on running any interactions on the page.

To prevent a crawls from taking longer than expected, you may want to set crawl_timeout to a max duration for an entire crawl; a common duration is 120 seconds. However you may want to set this based on the max limit of pages you're expecting to crawl. Default is no timeout. See example below.

Crawl Timeout

params = [{ "url": "https://www.example.com", "limit": 20, "request": "chrome", "return_format": "markdown", "crawl_timeout": { "secs": 120, "nanos": 0 } } }

In some cases, setting a client timeout in your code may be necessary. In Python Requests, you can set a timeout for connection and reading responses. See example below.

Client Timeout

import requests import os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } CONNECTION_TIMEOUT = 15 READ_TIMEOUT = 30 params = { "url": "https://www.example.com", "limit": 30, "depth": 3, "request": "smart", "return_format": "markdown" } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params, stream=True, # Applies to time between consecutive data chunks in streaming mode timeout=(CONNECTION_TIMEOUT, READ_TIMEOUT) ) print(response.json())

Tip: A common tactic when sending single page URL requests is to set a client read timeout of 60 secs, and retry the request immediately if it times out. This is useful to prevent scrapes from potentially taking longer than expected and to continue processing the next request. Adjust your retries and timeout strategy based on the volume of requests and the number of pages you're scraping.

Concurrency + Streaming = 🚀

Spider crawls pages quickly using full concurrency. By default all users benefit from the efficiency of running a Rust-based crawler. The crawler currently has a 50,000 request per minute limit. Check out the docs on streaming.