Scraping and Crawling
Web scrape or crawl reliably to get data from anywhere.
Scraping Single Page
Once you have an API key, use any of our SDK libraries or direct API. The following are settings to try first:
Single Page Scrape Using API in Python
import requests
import os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
params = {
"url": "https://www.example.com",
"request": "smart", # Automatically decides which mode to use
"return_format": "markdown", # LLM friendly format
"proxy_enabled": True
}
response = requests.post(
'https://api.spider.cloud/scrape', # set to scrape endpoint
headers=headers,
json=params
)
print(response.json())
Pro Tip:
We recommend testing with a variety of URLs and assess the responses you get back.
Crawling
To crawl a starting URL use the crawl
endpoint, or use the crawl_url
method in the SDK library. Set max limit
of pages to crawl. When testing your code with a website that potentially may have thousands of links, it's useful to set a maximum limit
that is reasonable before crawling more (e.g. limit of 30
pages).
Crawling Pages Using API in Python
import requests
import os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
params = {
"url": "https://www.example.com",
"limit": 30, # Maximum number of pages to crawl
"depth": 3, # Reasonable depth for small sites
"request": "smart",
"return_format": "markdown"
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params
)
print(response.json())
Example Response from Crawling and Scraping
import requests
import os
# ... truncated code
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params
)
for result in response:
print(result['content']) # Main HTML content or text
print(result['url']) # URL of the page
print(result['status']) # HTTP Status code
print(result['error']) # Error message if available
print(result['costs']) # Cost breakdown in USD for the page
Request Types
The following request types are supported
smart
(default): Automatically determines whether to usehttp
orchrome
based on heuristics or when javascript is needed to render the page.http
: Performs basic HTTP request. This is the fastest and most cost-efficient option, ideal for pages with static content or simple HTML responses.chrome
: Uses a headless Chrome browser to fetch the page. For pages that need JavaScript rendering or need to run interactions on the page. This may be slower in comparison to http/smart modes.
Streaming Responses
Use streaming to take full advantage of the crawling speed and process your requests as they are finished.