Key Concepts

The building blocks behind Spider — how crawling, scraping, and data delivery work together.

Crawling vs. Scraping

Spider supports two core operations. Scraping fetches a single page and returns its content. Crawling starts from a URL and follows links to discover and fetch multiple pages across a site. Both operations accept the same parameters for output format, proxy usage, and request mode. Use the Scraping and Crawling docs for details on each endpoint.

Request Modes

Every request uses one of three modes. The default smart mode inspects each page and decides whether a lightweight HTTP fetch is enough or if a full Chrome browser is needed for JavaScript rendering.

ModeWhen to useSpeedCostJS Rendering
smartDefault. Works for most sites.FastLow-MediumAuto-detected
httpStatic HTML, APIs, known simple pages.FastestLowestNo
chromeSPAs, JS-rendered content, bot-protected sites.SlowerHigherYes

Concurrent Crawling

Spider's Rust-based engine runs crawls with full concurrency — multiple pages are fetched, rendered, and processed in parallel. This means a 500-page crawl does not take 500x longer than a single page. Concurrency is managed server-side, so you don't need to manage thread pools or connection limits in your code. For large crawls, pair concurrency with streaming to process pages as they arrive.

Output Formats

The return_format parameter controls how Spider delivers page content. Markdown is the most common choice for AI workflows.

FormatWhat you getBest for
rawOriginal HTML as returned by the server.Parsing with your own tools, archiving.
markdownClean text with structure preserved. Navigation, scripts, and boilerplate stripped.LLMs, RAG pipelines, content analysis.
textPlain text without any markup.Simple text extraction, word counts.
bytesBinary data for non-HTML resources.PDFs, images, file downloads.

Streaming

When streaming is enabled, Spider sends each page as a JSON line the moment it finishes processing, rather than buffering the entire result set. This reduces memory usage, avoids HTTP timeouts, and gives you faster time-to-first-result. See Concurrent Streaming for full examples.

Stream pages as they arrive

import requests, json, os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/jsonl', # Enable streaming } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json={"url": "https://example.com", "limit": 50, "return_format": "markdown"}, stream=True ) for line in response.iter_lines(): if line: page = json.loads(line) print(f"Received: {page['url']} ({page.get('status')})")

Screenshot Capabilities

The /screenshot endpoint captures a full-page screenshot of any URL and returns it as a base64-encoded PNG. This is useful for visual regression testing, archiving page appearances, or providing visual context alongside extracted text. Screenshots use Chrome rendering automatically, so they work on JavaScript-heavy pages.

AI-Powered Extraction

Spider can extract structured data from pages using AI models. Pass a gpt_config object with a prompt describing the fields you want, and Spider returns structured JSON instead of raw page content. This is ideal for pulling product details, contact information, or any repeatable data pattern from pages without writing CSS selectors. See JSON Scraping for more.

Credits System

Usage is measured in credits at a rate of $1 / 10,000 credits. Each crawled page costs a base amount, with additional credits for features like Chrome rendering, proxy usage, and AI extraction. The cost breakdown is included in every API response under the costs field, so you can track spend per-request. View your balance and usage history on the usage page.