Scraping and Crawling
Scraping Single Page
Once you have an API key, use any of our SDK libraries or direct API. We recommend testing with a variety of URLs and assess the responses you get back. The following are settings to try first:
Single Page Scrape Using API in Python
import requests
import os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
params = {
"url": "https://www.example.com",
"request": "smart", # Automatically decides which mode to use
"return_format": "markdown", # LLM friendly format
"proxy_enabled": True
}
response = requests.post(
'https://api.spider.cloud/scrape', # set to scrape endpoint
headers=headers,
json=params
)
print(response.json())
Crawling
To crawl a starting URL use the crawl
endpoint, or use the crawl_url
method in the SDK library. Set max limit
of pages to crawl. When testing your code with a website that potentially may have thousands of links, it's useful to set a maximum limit
that is reasonable before crawling more (e.g. limit of 30
pages).
Crawling Pages Using API in Python
import requests
import os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
params = {
"url": "https://www.example.com",
"limit": 30, # Maximum number of pages to crawl
"depth": 3, # Reasonable depth for small sites
"request": "smart",
"return_format": "markdown"
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params
)
print(response.json())
Example Response
Example Response from Crawling and Scraping
import requests
import os
# ... truncated code
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params
)
for result in response:
print(result['content']) # Main HTML content or text
print(result['url']) # URL of the page
print(result['status']) # HTTP Status code
print(result['error']) # Error message if available
print(result['costs']) # Cost breakdown in USD for the page
Request Types
smart
(default): Automatically determines whether to usehttp
orchrome
based on heuristics or when javascript is needed to render the page.http
: Performs basic HTTP request. This is the fastest and most cost-efficient option, ideal for pages with static content or simple HTML responses.chrome
: Uses a headless Chrome browser to fetch the page. For pages that need JavaScript rendering or need to run interactions on the page. This may be slower in comparison to http/smart modes.
Streaming Responses
Use streaming to take full advantage of the crawling speed and process your requests as they are finished. Check out the docs on streaming.