Efficient Scraping
Sending Multiple URLs in One Request
Concatenate multiple urls together as a string for the URL parameter to send concurrently to the crawler in one request. Useful for scraping paginated lists or scraping many URLs with the same parameter payload. Stream the responses as they are finished to help speed up processing.
Multiple URLs
params = {
"url": "https://www.example.com, https://example2.com",
"limit": 100,
"request": "smart",
"return_format": "markdown"
}
Batch Mode
Another way to send multiple URLs to be crawled at once is to use batch mode; sending a batch of parameters as an array. This is an option if you prefer to customize the parameters for each starting URL. Particularly useful for scraping multiple web pages with custom parameter payloads like parsing selectors.
Batch Mode
params = [{
"url": "https://www.example.com",
"limit": 5,
"request": "chrome",
"return_format": "markdown"
},
{
"url": "https://www.example2.com/",
"limit": 10,
"request": "smart",
"return_format": "markdown"
},
{
"url": "https://www.example3.com/",
"limit": 1,
"request": "chrome",
"return_format": "raw"
}
Retries
Spider will automatically retry failed requests which includes retrying/rotating different proxy types (from datacenter to residential) to increase the chances of success. However, you may decide to retry a request manually. To prevent retry stacking between the crawler and your client, we recommend retrying up to 2 times and to use different parameters for request type (e.g. from http
to chrome
) or a different proxy type (e.g. switching to proxy_mobile
or proxy_lightning
). You can skip retries on hard failure error codes like 404, 401 etc.
Note: Use one proxy setting per request.
Timeouts
You can set a custom timeout for each page using the request_timeout
parameter. The default is 60
seconds. Increase the timeout if you plan on running any interactions on the page.
To prevent a crawls from taking longer than expected, you may want to set crawl_timeout
to a max duration for an entire crawl; a common duration is 120
seconds. However you may want to set this based on the max limit of pages you're expecting to crawl. Default is no timeout. See example below.
Crawl Timeout
params = [{
"url": "https://www.example.com",
"limit": 20,
"request": "chrome",
"return_format": "markdown",
"crawl_timeout": {
"secs": 120,
"nanos": 0
}
}
}
In some cases, setting a client timeout in your code may be necessary. In Python Requests, you can set a timeout for connection and reading responses. See example below.
Client Timeout
import requests
import os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
CONNECTION_TIMEOUT = 15
READ_TIMEOUT = 30
params = {
"url": "https://www.example.com",
"limit": 30,
"depth": 3,
"request": "smart",
"return_format": "markdown"
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params,
stream=True, # Applies to time between consecutive data chunks in streaming mode
timeout=(CONNECTION_TIMEOUT, READ_TIMEOUT)
)
print(response.json())
Tip: A common tactic when sending single page URL requests is to set a client read timeout of 60 secs
, and retry the request immediately if it times out. This is useful to prevent scrapes from potentially taking longer than expected and to continue processing the next request. Adjust your retries and timeout strategy based on the volume of requests and the number of pages you're scraping.
Concurrency + Streaming = 🚀
Spider crawls pages quickly using full concurrency. By default all users benefit from the efficiency of running a Rust-based crawler. The crawler currently has a 50,000 request per minute limit. Check out the docs on streaming.