Key Concepts
The building blocks behind Spider — how crawling, scraping, and data delivery work together.
Crawling vs. Scraping
Spider supports two core operations. Scraping fetches a single page and returns its content. Crawling starts from a URL and follows links to discover and fetch multiple pages across a site. Both operations accept the same parameters for output format, proxy usage, and request mode. Use the Scraping and Crawling docs for details on each endpoint.
Request Modes
Every request uses one of three modes. The default smart mode inspects each page and decides whether a lightweight HTTP fetch is enough or if a full Chrome browser is needed for JavaScript rendering.
| Mode | When to use | Speed | Cost | JS Rendering |
|---|---|---|---|---|
smart | Default. Works for most sites. | Fast | Low-Medium | Auto-detected |
http | Static HTML, APIs, known simple pages. | Fastest | Lowest | No |
chrome | SPAs, JS-rendered content, bot-protected sites. | Slower | Higher | Yes |
Concurrent Crawling
Spider's Rust-based engine runs crawls with full concurrency — multiple pages are fetched, rendered, and processed in parallel. This means a 500-page crawl does not take 500x longer than a single page. Concurrency is managed server-side, so you don't need to manage thread pools or connection limits in your code. For large crawls, pair concurrency with streaming to process pages as they arrive.
Output Formats
The return_format parameter controls how Spider delivers page content. Markdown is the most common choice for AI workflows.
| Format | What you get | Best for |
|---|---|---|
raw | Original HTML as returned by the server. | Parsing with your own tools, archiving. |
markdown | Clean text with structure preserved. Navigation, scripts, and boilerplate stripped. | LLMs, RAG pipelines, content analysis. |
text | Plain text without any markup. | Simple text extraction, word counts. |
bytes | Binary data for non-HTML resources. | PDFs, images, file downloads. |
Streaming
When streaming is enabled, Spider sends each page as a JSON line the moment it finishes processing, rather than buffering the entire result set. This reduces memory usage, avoids HTTP timeouts, and gives you faster time-to-first-result. See Concurrent Streaming for full examples.
Stream pages as they arrive
import requests, json, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl', # Enable streaming
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json={"url": "https://example.com", "limit": 50, "return_format": "markdown"},
stream=True
)
for line in response.iter_lines():
if line:
page = json.loads(line)
print(f"Received: {page['url']} ({page.get('status')})")Screenshot Capabilities
The /screenshot endpoint captures a full-page screenshot of any URL and returns it as a base64-encoded PNG. This is useful for visual regression testing, archiving page appearances, or providing visual context alongside extracted text. Screenshots use Chrome rendering automatically, so they work on JavaScript-heavy pages.
AI-Powered Extraction
Spider can extract structured data from pages using AI models. Pass a gpt_config object with a prompt describing the fields you want, and Spider returns structured JSON instead of raw page content. This is ideal for pulling product details, contact information, or any repeatable data pattern from pages without writing CSS selectors. See JSON Scraping for more.
Credits System
Usage is measured in credits at a rate of $1 / 10,000 credits. Each crawled page costs a base amount, with additional credits for features like Chrome rendering, proxy usage, and AI extraction. The cost breakdown is included in every API response under the costs field, so you can track spend per-request. View your balance and usage history on the usage page.