NEW AI Studio is now available Try it now
POST /crawl

Web Crawling API

Point Spider at any website and it will recursively discover and collect every page. Follow links across an entire domain, capture content in your preferred format, and stream results back as they're found, all from a single API call.

How It Works

1

Submit a URL

Send one or more seed URLs to the crawl endpoint. Spider begins by loading each page and identifying every link on it.

2

Recursive Discovery

Spider follows links within the same domain, expanding outward until it reaches your configured depth or page limit. Duplicate URLs are automatically skipped.

3

Structured Output

Each page is returned in your chosen format (markdown, plain text, HTML, or raw bytes) with optional metadata, links, and headers included.

Without Spider

  • Build and maintain your own crawler infrastructure
  • Handle URL deduplication, rate limiting, and politeness yourself
  • Parse HTML and extract meaningful content from messy markup
  • Manage headless browsers, proxies, and JavaScript rendering

With Spider

  • One POST request to crawl an entire website
  • Automatic dedup, robots.txt compliance, and smart rate control
  • Clean markdown or text output ready for AI pipelines
  • Built-in JS rendering, proxy rotation, and anti-bot handling

Key Capabilities

Depth & Page Limits

Control how deep the crawler goes with the depth parameter and cap total pages with limit. Set both to zero for unlimited crawling.

Multiple Output Formats

Get results as clean markdown, raw HTML, plain text, or bytes. Markdown output strips navigation, ads, and boilerplate so content is ready for LLM ingestion.

Smart Request Modes

Choose between HTTP-only for speed, Chrome rendering for JavaScript-heavy sites, or Smart mode that picks the right approach automatically.

Subdomain & TLD Expansion

Extend crawling beyond the seed domain. Include subdomains like docs.example.com or follow links to related TLDs.

Content Chunking

Automatically segment output by words, lines, characters, or sentences. Perfect for fitting content into embedding model context windows.

CSS & XPath Selectors

Target specific elements on every page using CSS or XPath selectors via css_extraction_map. Extract only the data you need.

Metadata & Headers

Collect page titles, descriptions, keywords, HTTP response headers, and cookies alongside content. Enable with simple boolean flags.

External Domain Linking

Treat additional domains as part of the same crawl with external_domains. Supports exact matches and regex patterns.

Budget Controls

Set credit budgets per crawl or per page to cap spending. The crawler stops when the limit is reached so there are no surprises on your bill.

Code Examples

Crawl a full website into markdown Python
from spider import Spider

client = Spider()

# Crawl up to 500 pages, return markdown
pages = client.crawl(
    "https://example.com",
    params={
        "return_format": "markdown",
        "limit": 500,
        "depth": 10,
        "metadata": True,
    }
)

for page in pages:
    print(page["url"], len(page["content"]))
Stream results with cURL cURL
curl -X POST https://api.spider.cloud/crawl \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/jsonl" \
  -d '{
    "url": "https://example.com",
    "limit": 100,
    "return_format": "markdown",
    "metadata": true,
    "return_page_links": true
  }'
Crawl with the JavaScript SDK JavaScript
import Spider from "@spider-cloud/spider-client";

const client = new Spider();

const pages = await client.crawl("https://example.com", {
  return_format: "markdown",
  limit: 500,
  depth: 10,
  metadata: true,
});

pages.forEach(page => console.log(page.url));

Common Parameters

Parameter Type Description
url string The starting URL to crawl. Comma-separate for multiple seed URLs.
limit integer Maximum number of pages to collect. Defaults to 0 (unlimited).
depth integer How many link-hops from the seed URL. Default 25.
return_format string Output format: markdown, html, text, or bytes.
request string Rendering mode: http, chrome, or smart (default).
metadata boolean Include page title, description, and keywords in the response.

See the full API reference for all available parameters including proxy configuration, caching, and network filtering.

Popular Crawling Use Cases

AI Training Datasets

Crawl documentation sites, blogs, and knowledge bases to build high-quality training corpora. Markdown output feeds directly into LLM fine-tuning pipelines.

RAG Knowledge Bases

Keep retrieval-augmented generation systems current by periodically crawling source websites. Use chunking to produce embedding-ready segments.

Content Migration

Migrate an entire website to a new CMS by crawling all pages and extracting clean content with metadata intact.

Competitive Analysis

Index competitor websites to understand their content strategy, product catalog, or pricing structure across hundreds of pages.

Related Resources

Ready to crawl the web?

Start collecting web content at scale in minutes. No infrastructure to manage.