Crawl an entire site from one URL.
Spider follows links across the domain — respecting your depth and page limits — and streams clean content back as each page completes. Markdown, HTML, plain text, or bytes.
- Pages / sec
- 100K+
- Depth
- ∞
- Req / min
- 10K
- Formats
- 5
One seed. Hundreds of pages.
Each depth level discovers new pages by following links from the previous level.
Three steps from URL to data.
Submit a seed URL
Send one or more starting URLs. Spider loads each page and identifies every link on it.
Recursive discovery
Links within the domain are followed until your depth or page limits are reached. Duplicates are automatically skipped.
Stream structured output
Each discovered page is returned in your chosen format — markdown, HTML, text, or bytes — with optional metadata, links, and headers.
What you stop owning.
Without Spider
- Build and maintain crawler infrastructure
- Handle dedup, rate limits, and politeness
- Parse HTML and extract content manually
- Manage browsers, proxies, JS rendering
With Spider
- One POST request to crawl an entire site
- Auto dedup, robots.txt, smart rate control
- Clean markdown or text for AI pipelines
- Built-in JS rendering, proxy rotation, anti-bot
What you can tune.
Crawl control
Depth & page limits
Control how deep the crawler goes with depth and cap total pages with limit. Set both to zero for unlimited.
Smart request modes
Choose HTTP-only for speed, Chrome for JS-heavy sites, or Smart mode that picks automatically.
Subdomain & TLD expansion
Extend crawling beyond the seed domain. Include subdomains like docs.example.com or follow links to related TLDs.
Output & extraction
Multiple output formats
Markdown, raw HTML, plain text, or bytes. Markdown strips nav, ads, and boilerplate for LLM-ready content.
Content chunking
Segment output by words, lines, characters, or sentences. Fit content into embedding model context windows.
CSS & XPath selectors
Target specific elements on every page with css_extraction_map. Extract only the data you need.
Data & controls
Metadata & headers
Collect page titles, descriptions, keywords, HTTP headers, and cookies alongside content.
External domain linking
Treat additional domains as part of the same crawl with external_domains. Supports exact matches and regex.
Budget controls
Set credit budgets per crawl or per page to cap spending. The crawler stops when the limit is reached.
cURL, Python, Node.
from spider import Spider
client = Spider()
# Crawl up to 500 pages, return markdown
pages = client.crawl(
"https://example.com",
params={
"return_format": "markdown",
"limit": 500,
"depth": 10,
"metadata": True,
},
)
for page in pages:
print(page["url"], len(page["content"]))url stringlimit intdepth intreturn_format stringrequest stringmetadata boolSee the full API reference for all available parameters including proxy configuration, caching, and network filtering.
Where teams reach for it.
AI training datasets
Crawl documentation sites, blogs, and knowledge bases to build high-quality training corpora. Markdown output feeds directly into LLM fine-tuning pipelines.
RAG knowledge bases
Keep retrieval-augmented generation systems current by periodically crawling source websites. Use chunking to produce embedding-ready segments.
Content migration
Migrate an entire website to a new CMS by crawling all pages and extracting clean content with metadata intact.
Competitive analysis
Index competitor websites to understand their content strategy, product catalog, or pricing structure across hundreds of pages.
More from the API.
Ready to crawl the web?
Start collecting web content at scale in minutes. No infrastructure to manage.