POST /crawl

Recursive Web
Crawling API

Point Spider at any URL. It recursively discovers every page on the domain, streams results as they're found, and returns clean content in your preferred format — all from a single API call.

Start Crawling Try in Playground

example.com ├── /about ├── /blog │ ├── /blog/post-1 │ ├── /blog/post-2 │ └── /blog/post-3 ├── /docs │ ├── /docs/getting-started │ └── /docs/api-reference ├── /pricing └── /contact → 10 pages discovered, depth 2

100K+ pages/sec

∞ depth

50K req/min

5 formats

Recursive Expansion

depth 0 Seed URL

depth 1 First links

depth 2 Two hops

200+

depth 3 Full discovery

How It Works

STEP 1

Submit a seed URL

Send one or more starting URLs. Spider loads each page and identifies every link on it.

STEP 2

Recursive discovery

Links within the domain are followed until your depth or page limits are reached. Duplicates are automatically skipped.

STEP 3

Stream structured output

Each discovered page is returned in your chosen format — markdown, HTML, text, or bytes — with optional metadata, links, and headers.

Without Spider

✕ Build and maintain crawler infrastructure
✕ Handle dedup, rate limits, and politeness
✕ Parse HTML and extract content manually
✕ Manage browsers, proxies, JS rendering

With Spider

✓ One POST request to crawl an entire site
✓ Auto dedup, robots.txt, smart rate control
✓ Clean markdown or text for AI pipelines
✓ Built-in JS rendering, proxy rotation, anti-bot

Key Capabilities

Crawl Control

Depth & Page Limits

Control how deep the crawler goes with depth and cap total pages with limit. Set both to zero for unlimited.

Smart Request Modes

Choose HTTP-only for speed, Chrome for JS-heavy sites, or Smart mode that picks automatically.

Subdomain & TLD Expansion

Extend crawling beyond the seed domain. Include subdomains like docs.example.com or follow links to related TLDs.

Output & Extraction

Multiple Output Formats

Markdown, raw HTML, plain text, or bytes. Markdown strips nav, ads, and boilerplate for LLM-ready content.

Content Chunking

Segment output by words, lines, characters, or sentences. Fit content into embedding model context windows.

CSS & XPath Selectors

Target specific elements on every page with css_extraction_map. Extract only the data you need.

Data & Controls

Metadata & Headers

Collect page titles, descriptions, keywords, HTTP headers, and cookies alongside content.

External Domain Linking

Treat additional domains as part of the same crawl with external_domains. Supports exact matches and regex.

Budget Controls

Set credit budgets per crawl or per page to cap spending. The crawler stops when the limit is reached.

Code Examples

Python cURL JavaScript

from spider import Spider

client = Spider()

# Crawl up to 500 pages, return markdown
pages = client.crawl(
    "https://example.com",
    params={
        "return_format": "markdown",
        "limit": 500,
        "depth": 10,
        "metadata": True,
    }
)

for page in pages:
    print(page["url"], len(page["content"]))

curl -X POST https://api.spider.cloud/crawl \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/jsonl" \
  -d '{
    "url": "https://example.com",
    "limit": 100,
    "return_format": "markdown",
    "metadata": true,
    "return_page_links": true
  }'

import Spider from "@spider-cloud/spider-client";

const client = new Spider();

const pages = await client.crawl("https://example.com", {
  return_format: "markdown",
  limit: 500,
  depth: 10,
  metadata: true,
});

pages.forEach(page => console.log(page.url));

url string

Starting URL(s) to crawl

limit int

Max pages. Default: unlimited

depth int

Link-hop depth. Default: 25

return_format string

markdown, html, text, bytes

request string

http, chrome, or smart

metadata bool

Include page metadata

Common Parameters

Parameter	Type	Description
`url`	string	The starting URL to crawl. Comma-separate for multiple seed URLs.
`limit`	integer	Maximum number of pages to collect. Defaults to 0 (unlimited).
`depth`	integer	How many link-hops from the seed URL. Default 25.
`return_format`	string	Output format: markdown, html, text, or bytes.
`request`	string	Rendering mode: http, chrome, or smart (default).
`metadata`	boolean	Include page title, description, and keywords in the response.

See the full API reference for all available parameters including proxy configuration, caching, and network filtering.

Popular Use Cases

AI Training Datasets — Crawl documentation sites, blogs, and knowledge bases to build high-quality training corpora. Markdown output feeds directly into LLM fine-tuning pipelines.

RAG

RAG Knowledge Bases — Keep retrieval-augmented generation systems current by periodically crawling source websites. Use chunking to produce embedding-ready segments.

CMS

Content Migration — Migrate an entire website to a new CMS by crawling all pages and extracting clean content with metadata intact.

BIZ

Competitive Analysis — Index competitor websites to understand their content strategy, product catalog, or pricing structure across hundreds of pages.

Ready to crawl the web?

Start collecting web content at scale in minutes. No infrastructure to manage.

Get Started View All APIs

Recursive Web
Crawling API

Recursive Expansion

How It Works

Submit a seed URL

Recursive discovery

Stream structured output

Without Spider

With Spider

Key Capabilities

Depth & Page Limits

Smart Request Modes

Subdomain & TLD Expansion

Multiple Output Formats

Content Chunking

CSS & XPath Selectors

Metadata & Headers

External Domain Linking

Budget Controls

Code Examples

Common Parameters

Popular Use Cases

Related Resources

Scrape API

Links API

Crawling Guide

Ready to crawl the web?

Recursive WebCrawling API

Recursive Expansion

How It Works

Submit a seed URL

Recursive discovery

Stream structured output

Without Spider

With Spider

Key Capabilities

Depth & Page Limits

Smart Request Modes

Subdomain & TLD Expansion

Multiple Output Formats

Content Chunking

CSS & XPath Selectors

Metadata & Headers

External Domain Linking

Budget Controls

Code Examples

Common Parameters

Popular Use Cases

Related Resources

Scrape API

Links API

Crawling Guide

Ready to crawl the web?

Recursive Web
Crawling API