Spider API Overview
What the API Does
Spider’s API turns any URL into structured data. You send a URL and configuration, and get back page content in the format you need: HTML, markdown, plain text, or structured JSON. The API handles proxy rotation, JavaScript rendering, rate limiting, and anti-bot detection on your behalf.
Core Capabilities
- Proxy rotation: Datacenter, residential, and mobile proxies with automatic failover
- Full concurrency: Crawl thousands of pages per minute using the Rust-based engine
- Smart mode: Automatically picks HTTP or Chrome rendering based on each page’s needs
- Caching: Repeated crawls of the same pages return faster
- Output formats: Markdown, HTML, plain text, or raw bytes
- AI extraction: Pull structured fields from pages using built-in LLM integration
- Anti-bot measures: Browser fingerprinting and proxy rotation to reduce blocks
See the full API reference for every endpoint and parameter.
Endpoints
| Endpoint | Purpose |
|---|---|
/crawl | Start from a URL, follow links, return content for each page |
/scrape | Fetch a single URL and return its content |
/search | Search the web and optionally scrape the results |
/screenshot | Capture a full-page screenshot as base64 PNG |
/links | Get the link graph for a URL |
/pipeline/extract-contacts | Extract contact information from pages |
Quick Start
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
# Crawl a site and get markdown
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://example.com",
"limit": 10,
"return_format": "markdown",
"request": "smart"
}
)
for page in response.json():
print(f"{page['url']} — {len(page['content'])} chars")
Request Modes
The request parameter controls how Spider fetches each page:
| Mode | When to use |
|---|---|
smart (default) | Automatically picks HTTP or Chrome based on page requirements |
http | Static pages, sitemaps, APIs. Fastest and cheapest |
chrome | SPAs, JS-rendered content, pages behind Cloudflare or similar protections |
Output Formats
Set return_format to control what you get back:
| Format | Output | Best for |
|---|---|---|
raw | Original HTML | Parsing with your own tools |
markdown | Clean markdown | LLM ingestion, RAG pipelines |
text | Plain text | Search indexing, NLP tasks |
bytes | Raw bytes | Binary content, downloads |
For AI workflows, markdown strips navigation, ads, and boilerplate, giving you just the page content.
Streaming Large Crawls
For crawls over a few dozen pages, use streaming to process results as they arrive:
import requests, json, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://example.com",
"limit": 250,
"return_format": "markdown"
},
stream=True
)
for line in response.iter_lines(decode_unicode=True):
if line:
page = json.loads(line)
print(f"Crawled: {page['url']} ({page['status']})")
Set Content-Type to application/jsonl and enable stream=True in your HTTP client. See the streaming docs for more details.
SDK Libraries
Official SDKs handle authentication and provide typed methods for all endpoints:
- Python SDK:
pip install spider_client - JavaScript SDK:
npm install @spider-cloud/spider-client - Rust SDK:
cargo add spider-client
Spider also integrates with LangChain, LlamaIndex, CrewAI, and other AI frameworks as a document loader.
Getting Set Up
- Register or sign in.
- Purchase credits. Pay-as-you-go at $1 per 10,000 credits.
- Create an API key.
- Start making requests.
Track your usage on the usage page.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.