Skip to main content

Concepts

How crawling, scraping, and delivery fit together — and the knobs you'll reach for most often.

Core operations

Crawling vs. scraping

Spider supports two core operations. Scraping fetches a single page and returns its content. Crawling starts from a URL and follows links to discover and fetch multiple pages across a site. Both accept the same parameters for output format, proxy usage, and request mode. See Scraping and Crawlingfor endpoint specifics.

Request modes

Every request uses one of three modes. The default smart inspects each page and picks between a lightweight HTTP fetch and a full Chrome browser based on what the page actually needs.

ModeWhen to useSpeedCostJS rendering
smartDefault. Works for most sites.FastLow – mediumAuto-detected
httpStatic HTML, APIs, known simple pages.FastestLowestNo
chromeSPAs, JS-rendered content, bot-protected sites.SlowerHigherYes

Concurrent crawling

The Rust engine runs crawls with full concurrency. Pages are fetched, rendered, and processed in parallel, so a 500-page crawl doesn't take 500× longer than a single page. Concurrency is managed server-side — no thread pools or connection limits to wire up. For large jobs, pair concurrency with streamingso you can process pages the moment they arrive.

Output

Output formats

The return_format parameter controls how Spider delivers page content. Markdown is the default for AI workloads — structure preserved, navigation and ads stripped, clean LLM context at a fraction of the token cost of raw HTML.

FormatWhat you getBest for
rawOriginal HTML as returned by the server.Parsing with your own tools, archiving.
markdownClean text with structure preserved. Navigation, scripts, and boilerplate stripped.LLMs, RAG pipelines, content analysis.
textPlain text without any markup.Simple text extraction, word counts.
bytesBinary data for non-HTML resources.PDFs, images, file downloads.

Streaming

With streaming on, Spider returns each page as a JSON line the moment it finishes — no buffering the full result set. Lower memory, no HTTP timeouts, faster time to first result. See Concurrent Streamingfor full examples.

import requests, json, os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/jsonl', # Enable streaming } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json={"url": "https://example.com", "limit": 50, "return_format": "markdown"}, stream=True, ) for line in response.iter_lines(): if line: page = json.loads(line) print(f"Received: {page['url']} ({page.get('status')})")

Screenshots

The /screenshot endpoint captures full-page or viewport-sized images as PNG, JPEG, or WebP, returned as base64 or raw binary. Useful for visual regression tests, archiving page appearances, or pairing visual context with extracted text. Always uses Chrome rendering, so JavaScript-heavy pages render correctly.

Advanced

AI extraction

Spider can pull structured data from pages with AI. With an AI Studiosubscription, describe the fields you want and Spider returns structured JSON instead of raw content. Good for product details, contact info, or any repeatable shape — no CSS selectors required. See JSON Scrapingfor the parameter reference.

Credits

Usage is measured in credits at $1 per 10,000 credits. Each crawled page has a base cost; Chrome rendering, proxy usage, and AI extraction add on top. Failed requests, timeouts, and blocked pages cost zero. Every response includes a costs field with a per-request breakdown — view live balance and history on the usage page.