Skip to main content

Scraping and crawling

/scrape fetches a single URL and returns its content. /crawl starts from a URL, follows internal links, and returns content for each page. Both share the same output formats, proxy settings, and request modes.

Single page

Scrape one URL

Send a URL to /scrape and pick a return_format (markdown for LLMs, raw for the original HTML). Call the API directly or use any of our SDK libraries.

import requests, os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "request": "smart", # auto-pick HTTP vs Chrome "return_format": "markdown", # LLM-friendly format "proxy_enabled": True, } response = requests.post( 'https://api.spider.cloud/scrape', headers=headers, json=params, ) print(response.json())
Multiple pages

Crawl from a URL

/crawl starts from a URL and follows internal links. limit caps how many pages come back; depth controls how many link-levels deep the crawler goes. Set a reasonable limit when testing — most sites have thousands of pages.

import requests, os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "limit": 30, # max pages "depth": 3, # link-hops from the seed "request": "smart", "return_format": "markdown", } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params, ) print(response.json())
# Iterate over each page in the response for result in response.json(): print(result['content']) # body in the requested format print(result['url']) # final URL print(result['status']) # HTTP status print(result['error']) # error message if any print(result['costs']) # cost breakdown for this page
Reference

Request types

The request parameter controls how Spider fetches each page.

smartDefault
Auto-picks between HTTP and Chrome based on what the page actually needs.
httpFast
Static HTML only. Fastest and cheapest.
chromeJS / SPA
Full browser rendering. Use for SPAs or bot-protected sites.

Streaming responses

Use streamingto process pages the moment they finish crawling instead of waiting for the entire result set. Set Content-Type: application/jsonl and read the response line by line.

Response fields

Each page in the response array carries these fields.

FieldTypeDescription
urlstringThe URL that was crawled or scraped.
statusnumberHTTP status code from the target page (200, 404, 500, …).
contentstringPage content in the requested format — HTML, markdown, text, or base64 for screenshots.
errorstring | nullError message if the page failed to load. null on success.
costsobjectCost breakdown for this request in USD.
costs.total_costnumberTotal cost of this request.
costs.total_cost_formattedstringHuman-readable formatted total.
costs.ai_costnumberAI processing cost — extraction, labeling, etc.
costs.bytes_transferred_costnumberCost based on data transferred.
costs.compute_costnumberCost of compute resources — browser rendering, etc.
[ { "url": "https://example.com", "status": 200, "content": "<html>...</html>", "error": null, "costs": { "ai_cost": 0.0, "ai_cost_formatted": "0", "bytes_transferred_cost": 3.165e-9, "bytes_transferred_cost_formatted": "0.0000000032", "compute_cost": 0.0, "compute_cost_formatted": "0", "file_cost": 0.000029, "file_cost_formatted": "0.0000290000", "total_cost": 0.000029, "total_cost_formatted": "0.0000290000", "transform_cost": 0.0, "transform_cost_formatted": "0" } } ]

Handling errors

Some pages in a crawl will fail — 404s, timeouts, bot blocks. The API still returns an HTTP 200, but individual pages in the array may carry non-200 status values or an error field. Check per-page status before reading content. See Error Codesfor the full reference.

import requests, os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json={"url": "https://example.com", "limit": 20, "return_format": "markdown"}, ) # Check the Spider API request itself if response.status_code != 200: print(f"API error: {response.status_code} - {response.text}") else: data = response.json() succeeded = [p for p in data if p.get('status') == 200] failed = [p for p in data if p.get('status') != 200] print(f"Crawled {len(data)} pages — {len(succeeded)} ok, {len(failed)} failed") for page in failed: print(f" {page['url']} — status {page.get('status')} — {page.get('error', 'no details')}") for page in succeeded: content = page.get('content', '') if len(content) > 50: print(f" {page['url']} — {len(content)} chars")