Web Crawling API
Point Spider at any website and it will recursively discover and collect every page. Follow links across an entire domain, capture content in your preferred format, and stream results back as they're found, all from a single API call.
How It Works
Submit a URL
Send one or more seed URLs to the crawl endpoint. Spider begins by loading each page and identifying every link on it.
Recursive Discovery
Spider follows links within the same domain, expanding outward until it reaches your configured depth or page limit. Duplicate URLs are automatically skipped.
Structured Output
Each page is returned in your chosen format (markdown, plain text, HTML, or raw bytes) with optional metadata, links, and headers included.
Without Spider
- Build and maintain your own crawler infrastructure
- Handle URL deduplication, rate limiting, and politeness yourself
- Parse HTML and extract meaningful content from messy markup
- Manage headless browsers, proxies, and JavaScript rendering
With Spider
- One POST request to crawl an entire website
- Automatic dedup, robots.txt compliance, and smart rate control
- Clean markdown or text output ready for AI pipelines
- Built-in JS rendering, proxy rotation, and anti-bot handling
Key Capabilities
Depth & Page Limits
Control how deep the crawler goes with the depth parameter and cap total pages with limit. Set both to zero for unlimited crawling.
Multiple Output Formats
Get results as clean markdown, raw HTML, plain text, or bytes. Markdown output strips navigation, ads, and boilerplate so content is ready for LLM ingestion.
Smart Request Modes
Choose between HTTP-only for speed, Chrome rendering for JavaScript-heavy sites, or Smart mode that picks the right approach automatically.
Subdomain & TLD Expansion
Extend crawling beyond the seed domain. Include subdomains like docs.example.com or follow links to related TLDs.
Content Chunking
Automatically segment output by words, lines, characters, or sentences. Perfect for fitting content into embedding model context windows.
CSS & XPath Selectors
Target specific elements on every page using CSS or XPath selectors via css_extraction_map. Extract only the data you need.
Metadata & Headers
Collect page titles, descriptions, keywords, HTTP response headers, and cookies alongside content. Enable with simple boolean flags.
External Domain Linking
Treat additional domains as part of the same crawl with external_domains. Supports exact matches and regex patterns.
Budget Controls
Set credit budgets per crawl or per page to cap spending. The crawler stops when the limit is reached so there are no surprises on your bill.
Code Examples
from spider import Spider
client = Spider()
# Crawl up to 500 pages, return markdown
pages = client.crawl(
"https://example.com",
params={
"return_format": "markdown",
"limit": 500,
"depth": 10,
"metadata": True,
}
)
for page in pages:
print(page["url"], len(page["content"])) curl -X POST https://api.spider.cloud/crawl \
-H "Authorization: Bearer $SPIDER_API_KEY" \
-H "Content-Type: application/jsonl" \
-d '{
"url": "https://example.com",
"limit": 100,
"return_format": "markdown",
"metadata": true,
"return_page_links": true
}' import Spider from "@spider-cloud/spider-client";
const client = new Spider();
const pages = await client.crawl("https://example.com", {
return_format: "markdown",
limit: 500,
depth: 10,
metadata: true,
});
pages.forEach(page => console.log(page.url)); Common Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | The starting URL to crawl. Comma-separate for multiple seed URLs. |
limit | integer | Maximum number of pages to collect. Defaults to 0 (unlimited). |
depth | integer | How many link-hops from the seed URL. Default 25. |
return_format | string | Output format: markdown, html, text, or bytes. |
request | string | Rendering mode: http, chrome, or smart (default). |
metadata | boolean | Include page title, description, and keywords in the response. |
See the full API reference for all available parameters including proxy configuration, caching, and network filtering.
Popular Crawling Use Cases
AI Training Datasets
Crawl documentation sites, blogs, and knowledge bases to build high-quality training corpora. Markdown output feeds directly into LLM fine-tuning pipelines.
RAG Knowledge Bases
Keep retrieval-augmented generation systems current by periodically crawling source websites. Use chunking to produce embedding-ready segments.
Content Migration
Migrate an entire website to a new CMS by crawling all pages and extracting clean content with metadata intact.
Competitive Analysis
Index competitor websites to understand their content strategy, product catalog, or pricing structure across hundreds of pages.
Related Resources
Ready to crawl the web?
Start collecting web content at scale in minutes. No infrastructure to manage.