Skip to main content gottem  — one API for every scraper.
Crawl API POST /crawl

Crawl an entire site from one URL.

Spider follows links across the domain — respecting your depth and page limits — and streams clean content back as each page completes. Markdown, HTML, plain text, or bytes.

example.com depth 2
example.com
├── /about
├── /blog
│ ├── /blog/post-1
│ ├── /blog/post-2
│ └── /blog/post-3
├── /docs
│ ├── /docs/getting-started
│ └── /docs/api-reference
├── /pricing
└── /contact
→ 10 pages discovered
Pages / sec
100K+
Depth
Req / min
10K
Formats
5
Recursive expansion

One seed. Hundreds of pages.

Each depth level discovers new pages by following links from the previous level.

Depth 0 1 Seed URL
Depth 1 8 First links
Depth 2 47 Two hops
Depth 3 200+ Full discovery
How it works

Three steps from URL to data.

01

Submit a seed URL

Send one or more starting URLs. Spider loads each page and identifies every link on it.

02

Recursive discovery

Links within the domain are followed until your depth or page limits are reached. Duplicates are automatically skipped.

03

Stream structured output

Each discovered page is returned in your chosen format — markdown, HTML, text, or bytes — with optional metadata, links, and headers.

Before / after

What you stop owning.

Without Spider

  • Build and maintain crawler infrastructure
  • Handle dedup, rate limits, and politeness
  • Parse HTML and extract content manually
  • Manage browsers, proxies, JS rendering

With Spider

  • One POST request to crawl an entire site
  • Auto dedup, robots.txt, smart rate control
  • Clean markdown or text for AI pipelines
  • Built-in JS rendering, proxy rotation, anti-bot
Capabilities

What you can tune.

Crawl control

Depth & page limits

Control how deep the crawler goes with depth and cap total pages with limit. Set both to zero for unlimited.

Smart request modes

Choose HTTP-only for speed, Chrome for JS-heavy sites, or Smart mode that picks automatically.

Subdomain & TLD expansion

Extend crawling beyond the seed domain. Include subdomains like docs.example.com or follow links to related TLDs.

Output & extraction

Multiple output formats

Markdown, raw HTML, plain text, or bytes. Markdown strips nav, ads, and boilerplate for LLM-ready content.

Content chunking

Segment output by words, lines, characters, or sentences. Fit content into embedding model context windows.

CSS & XPath selectors

Target specific elements on every page with css_extraction_map. Extract only the data you need.

Data & controls

Metadata & headers

Collect page titles, descriptions, keywords, HTTP headers, and cookies alongside content.

External domain linking

Treat additional domains as part of the same crawl with external_domains. Supports exact matches and regex.

Budget controls

Set credit budgets per crawl or per page to cap spending. The crawler stops when the limit is reached.

Examples

cURL, Python, Node.

from spider import Spider

client = Spider()

# Crawl up to 500 pages, return markdown
pages = client.crawl(
    "https://example.com",
    params={
        "return_format": "markdown",
        "limit": 500,
        "depth": 10,
        "metadata": True,
    },
)

for page in pages:
    print(page["url"], len(page["content"]))
Common parameters
url string
Starting URL(s) to crawl. Comma-separate for multiple.
limit int
Max pages. Default: unlimited.
depth int
Link-hop depth. Default: 25.
return_format string
markdown, html, text, or bytes.
request string
http, chrome, or smart (default).
metadata bool
Include page metadata in the response.

See the full API reference for all available parameters including proxy configuration, caching, and network filtering.

Use cases

Where teams reach for it.

ML

AI training datasets

Crawl documentation sites, blogs, and knowledge bases to build high-quality training corpora. Markdown output feeds directly into LLM fine-tuning pipelines.

RAG

RAG knowledge bases

Keep retrieval-augmented generation systems current by periodically crawling source websites. Use chunking to produce embedding-ready segments.

CMS

Content migration

Migrate an entire website to a new CMS by crawling all pages and extracting clean content with metadata intact.

BIZ

Competitive analysis

Index competitor websites to understand their content strategy, product catalog, or pricing structure across hundreds of pages.

Related

More from the API.

Get started

Ready to crawl the web?

Start collecting web content at scale in minutes. No infrastructure to manage.