Scraping and Crawling

The /scrape and /crawl endpoints are the core of Spider's API. Scraping fetches a single URL and returns its content. Crawling starts from a URL, discovers linked pages, and returns content for each. Both support the same output formats, proxy settings, and request modes.

Scraping a Single Page

Use the /scrape endpoint to fetch one page. Set return_format to control the output — markdown works well for LLM pipelines, raw gives you the original HTML. Use any of our SDK libraries or call the API directly:

Single Page Scrape Using API in Python

import requests import os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "request": "smart", # Automatically decides which mode to use "return_format": "markdown", # LLM friendly format "proxy_enabled": True } response = requests.post( 'https://api.spider.cloud/scrape', # set to scrape endpoint headers=headers, json=params ) print(response.json())

Crawling Multiple Pages

The /crawl endpoint starts from a URL and follows internal links to discover pages across a site. Set limit to cap the number of pages returned, and depth to control how many link-levels deep the crawler goes. Always set a reasonable limit when testing — most sites have thousands of pages.

Crawling Pages Using API in Python

import requests import os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "limit": 30, # Maximum number of pages to crawl "depth": 3, # Reasonable depth for small sites "request": "smart", "return_format": "markdown" } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params ) print(response.json())

Example Response from Crawling and Scraping

import requests import os # ... truncated code response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params ) for result in response: print(result['content']) # Main HTML content or text print(result['url']) # URL of the page print(result['status']) # HTTP Status code print(result['error']) # Error message if available print(result['costs']) # Cost breakdown in USD for the page

Request Types

Control how Spider fetches each page using the request parameter.

  • smart (default): Automatically determines whether to use http or chrome based on heuristics or when javascript is needed to render the page.
  • http: Performs basic HTTP request. This is the fastest and most cost-efficient option, ideal for pages with static content or simple HTML responses.
  • chrome: Uses a headless Chrome browser to fetch the page. For pages that need JavaScript rendering or need to run interactions on the page. This may be slower in comparison to http/smart modes.

Streaming Responses

Use streaming to process pages as they finish crawling instead of waiting for the entire result set.

Response Fields

Each page in the response array contains the following fields.

FieldTypeDescription
urlstringThe URL that was crawled or scraped.
statusnumberHTTP status code returned by the target page (e.g., 200, 404, 500).
contentstringThe page content in the requested format (HTML, markdown, text, or base64 for screenshots).
errorstring | nullError message if the page failed to load. Null on success.
costsobjectCost breakdown for this request in USD.
costs.total_costnumberTotal cost of this request.
costs.total_cost_formattedstringHuman-readable formatted total cost.
costs.ai_costnumberCost of AI processing (extraction, labeling, etc.).
costs.bytes_transferred_costnumberCost based on the amount of data transferred.
costs.compute_costnumberCost of compute resources used (browser rendering, etc.).

Example Response

[ { "url": "https://example.com", "status": 200, "content": "...", "error": null, "costs": { "ai_cost": 0.0, "ai_cost_formatted": "0", "bytes_transferred_cost": 3.165e-9, "bytes_transferred_cost_formatted": "0.0000000032", "compute_cost": 0.0, "compute_cost_formatted": "0", "file_cost": 0.000029, "file_cost_formatted": "0.0000290000", "total_cost": 0.000029, "total_cost_formatted": "0.0000290000", "transform_cost": 0.0, "transform_cost_formatted": "0" } } ]

Handling Errors

Some pages in a crawl will fail — 404s, timeouts, bot blocks. The API still returns a 200 HTTP response, but individual pages in the array may have non-200 status values or an error field. Always check per-page status codes. See Error Codes for the full reference.

Handle per-page errors in a crawl

import requests, os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json={"url": "https://example.com", "limit": 20, "return_format": "markdown"} ) # Check the HTTP status of the Spider API request itself if response.status_code != 200: print(f"API error: {response.status_code} — {response.text}") else: data = response.json() succeeded = [p for p in data if p.get('status') == 200] failed = [p for p in data if p.get('status') != 200] print(f"Crawled {len(data)} pages: {len(succeeded)} succeeded, {len(failed)} failed") for page in failed: print(f" {page['url']} — status {page.get('status')} — {page.get('error', 'no details')}") for page in succeeded: content = page.get('content', '') if len(content) > 50: # Skip near-empty pages print(f" {page['url']} — {len(content)} chars")