Scraping and Crawling
The /scrape and /crawl endpoints are the core of Spider's API. Scraping fetches a single URL and returns its content. Crawling starts from a URL, discovers linked pages, and returns content for each. Both support the same output formats, proxy settings, and request modes.
Scraping a Single Page
Use the /scrape endpoint to fetch one page. Set return_format to control the output — markdown works well for LLM pipelines, raw gives you the original HTML. Use any of our SDK libraries or call the API directly:
Single Page Scrape Using API in Python
import requests
import os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
params = {
"url": "https://www.example.com",
"request": "smart", # Automatically decides which mode to use
"return_format": "markdown", # LLM friendly format
"proxy_enabled": True
}
response = requests.post(
'https://api.spider.cloud/scrape', # set to scrape endpoint
headers=headers,
json=params
)
print(response.json())Crawling Multiple Pages
The /crawl endpoint starts from a URL and follows internal links to discover pages across a site. Set limit to cap the number of pages returned, and depth to control how many link-levels deep the crawler goes. Always set a reasonable limit when testing — most sites have thousands of pages.
Crawling Pages Using API in Python
import requests
import os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
params = {
"url": "https://www.example.com",
"limit": 30, # Maximum number of pages to crawl
"depth": 3, # Reasonable depth for small sites
"request": "smart",
"return_format": "markdown"
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params
)
print(response.json())Example Response from Crawling and Scraping
import requests
import os
# ... truncated code
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params
)
for result in response:
print(result['content']) # Main HTML content or text
print(result['url']) # URL of the page
print(result['status']) # HTTP Status code
print(result['error']) # Error message if available
print(result['costs']) # Cost breakdown in USD for the pageRequest Types
Control how Spider fetches each page using the request parameter.
smart(default): Automatically determines whether to usehttporchromebased on heuristics or when javascript is needed to render the page.http: Performs basic HTTP request. This is the fastest and most cost-efficient option, ideal for pages with static content or simple HTML responses.chrome: Uses a headless Chrome browser to fetch the page. For pages that need JavaScript rendering or need to run interactions on the page. This may be slower in comparison to http/smart modes.
Streaming Responses
Use streaming to process pages as they finish crawling instead of waiting for the entire result set.
Response Fields
Each page in the response array contains the following fields.
| Field | Type | Description |
|---|---|---|
url | string | The URL that was crawled or scraped. |
status | number | HTTP status code returned by the target page (e.g., 200, 404, 500). |
content | string | The page content in the requested format (HTML, markdown, text, or base64 for screenshots). |
error | string | null | Error message if the page failed to load. Null on success. |
costs | object | Cost breakdown for this request in USD. |
costs.total_cost | number | Total cost of this request. |
costs.total_cost_formatted | string | Human-readable formatted total cost. |
costs.ai_cost | number | Cost of AI processing (extraction, labeling, etc.). |
costs.bytes_transferred_cost | number | Cost based on the amount of data transferred. |
costs.compute_cost | number | Cost of compute resources used (browser rendering, etc.). |
Example Response
[
{
"url": "https://example.com",
"status": 200,
"content": "...",
"error": null,
"costs": {
"ai_cost": 0.0,
"ai_cost_formatted": "0",
"bytes_transferred_cost": 3.165e-9,
"bytes_transferred_cost_formatted": "0.0000000032",
"compute_cost": 0.0,
"compute_cost_formatted": "0",
"file_cost": 0.000029,
"file_cost_formatted": "0.0000290000",
"total_cost": 0.000029,
"total_cost_formatted": "0.0000290000",
"transform_cost": 0.0,
"transform_cost_formatted": "0"
}
}
]Handling Errors
Some pages in a crawl will fail — 404s, timeouts, bot blocks. The API still returns a 200 HTTP response, but individual pages in the array may have non-200 status values or an error field. Always check per-page status codes. See Error Codes for the full reference.
Handle per-page errors in a crawl
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json={"url": "https://example.com", "limit": 20, "return_format": "markdown"}
)
# Check the HTTP status of the Spider API request itself
if response.status_code != 200:
print(f"API error: {response.status_code} — {response.text}")
else:
data = response.json()
succeeded = [p for p in data if p.get('status') == 200]
failed = [p for p in data if p.get('status') != 200]
print(f"Crawled {len(data)} pages: {len(succeeded)} succeeded, {len(failed)} failed")
for page in failed:
print(f" {page['url']} — status {page.get('status')} — {page.get('error', 'no details')}")
for page in succeeded:
content = page.get('content', '')
if len(content) > 50: # Skip near-empty pages
print(f" {page['url']} — {len(content)} chars")automation_scripts parameter.