Efficient Scraping
Web scraping with low latency request.
Sending Multiple URLs in One Request
Concatenate multiple URLs together as a string for the URL parameter to send concurrently to the crawler in one request. Useful for scraping paginated lists or scraping many URLs with the same parameter payload. Stream the responses as they are finished to help speed up processing.
Multiple URLs
params = {
"url": "https://www.example.com, https://example2.com",
"limit": 100,
"request": "smart",
"return_format": "markdown"
}
Batch Mode
Another way to send multiple URLs to be crawled at once is to use batch mode; sending a batch of parameters as an array. This is an option if you prefer to customize the parameters for each starting URL. Particularly useful for scraping multiple web pages with custom parameter payloads like parsing selectors.
Batch Mode
params = [{
"url": "https://www.example.com",
"limit": 5,
"request": "chrome",
"return_format": "markdown"
},
{
"url": "https://www.example2.com/",
"limit": 10,
"request": "smart",
"return_format": "markdown"
},
{
"url": "https://www.example3.com/",
"limit": 1,
"request": "chrome",
"return_format": "raw"
}
Retries
Spider will automatically retry failed requests which includes retrying/rotating different proxy types (from datacenter to residential) to increase the chances of success. However, you may decide to retry a request manually. To prevent retry stacking between the crawler and your client, we recommend retrying up to 2 times and to use different parameters for request type (e.g. from http
to chrome
) or a different proxy type (e.g. switching to proxy_mobile
or proxy_lightning
). You can skip retries on hard failure status codes like 404 and 401.
Timeouts
You can set a custom timeout for each page using the request_timeout
parameter. By default the timeout is set to 120
seconds. Increase the timeout if you plan on running any interactions on the page. To prevent a crawls from taking longer than expected, you may want to set crawl_timeout
to a max duration for an entire crawl; a common duration is 120
seconds. However you may want to set this based on the max limit of pages you're expecting to crawl. See example below:
Crawl Timeout
params = [{
"url": "https://www.example.com",
"limit": 20,
"request": "chrome",
"return_format": "markdown",
"crawl_timeout": {
"secs": 120,
"nanos": 0
}
}
}
Sometimes you might need to set a timeout in your code, and if you're using Python's requests
library you can add a timeout for connecting and reading.
Client Timeout
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
CONNECTION_TIMEOUT = 15
READ_TIMEOUT = 30
params = {
"url": "https://www.example.com",
"limit": 30,
"depth": 3,
"request": "smart",
"return_format": "markdown"
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params,
stream=True, # Applies to time between consecutive data chunks in streaming mode
timeout=(CONNECTION_TIMEOUT, READ_TIMEOUT)
)
print(response.json())
60 secs
, and retry the request immediately if it times out. This is useful to prevent scrapes from potentially taking longer than expected and to continue processing the next request. Adjust your retries and timeout strategy based on the volume of requests and the number of pages you're scraping.Concurrency and Streaming
Spider crawls pages quickly using full concurrency. By default all users benefit from the efficiency of running a Rust-based crawler. The crawler currently has a 50,000 request per minute limit. Learn more about streaming on our docs.