Self-Hosting Spider: Using the Open Source Rust Crate

The spider Rust crate is the open source core of Spider’s crawling engine. It’s MIT-licensed and runs standalone, so you can build your own crawler without touching the cloud API.

This guide covers what you get with the crate, how to set it up, and when the cloud API makes more sense.

What you get (and what you don’t)

The spider crate gives you a high-performance async web crawler built on tokio. Here’s what’s included and what’s only available through the cloud API:

Feature	OSS crate	Cloud API
HTML crawling	Yes	Yes
Async concurrent requests	Yes	Yes
Depth/limit control	Yes	Yes
Robots.txt handling	Yes	Yes
CSS selector extraction	Yes	Yes
HTTP/2 support	Yes	Yes
Configurable user agent	Yes	Yes
Markdown conversion	Via `spider_transformations`	Built-in
JavaScript rendering	Chrome feature flag	Smart mode (auto-detect)
Anti-bot bypass	No	Built-in
Proxy rotation	No (BYO)	Managed (residential, mobile, ISP)
AI extraction	No	AI Studio + Spider Browser
Managed scaling	No	Auto-scaling
Browser automation	No	WebSocket sessions
MCP server	No	Built-in

The crate handles the crawling engine. Anti-bot bypass, proxy rotation, browser automation, and AI features are part of the cloud platform.

Quick start

Add the crate to your project:

cargo add spider

Basic crawl:

use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");

    website.crawl().await;

    for page in website.get_pages().unwrap().iter() {
        println!("{} - {} bytes", page.get_url(), page.get_html_bytes_u8().len());
    }
}

This crawls example.com, follows links on the same domain, and prints each page’s URL and size.

Configuration

The Configuration struct controls crawl behavior:

use spider::website::Website;
use spider::configuration::Configuration;

#[tokio::main]
async fn main() {
    let mut config = Configuration::new();
    config.with_limit(100);              // Max 100 pages
    config.with_depth(3);                // Max 3 links deep
    config.with_respect_robots_txt(true); // Honor robots.txt
    config.with_delay(250);               // 250ms between requests
    config.with_user_agent(Some("MyBot/1.0".into()));

    let mut website = Website::new("https://docs.example.com")
        .with_configuration(config)
        .build()
        .unwrap();

    website.crawl().await;

    println!("Crawled {} pages", website.get_pages().unwrap().len());
}

Key configuration options

Option	Method	Description
Page limit	`with_limit(n)`	Maximum number of pages to crawl
Depth	`with_depth(n)`	Maximum link depth from start URL
Delay	`with_delay(ms)`	Milliseconds between requests to same domain
User agent	`with_user_agent(Some(s))`	Custom User-Agent header
Robots.txt	`with_respect_robots_txt(true)`	Honor robots.txt rules
Subdomains	`with_subdomains(true)`	Include subdomains in crawl
TLD	`with_tld(true)`	Crawl all subdomains under the TLD
External domains	`with_external_domains(vec)`	Allow crawling specific external domains
Blacklist	`with_blacklist_url(vec)`	URL patterns to skip

Markdown conversion

The spider_transformations crate converts HTML to clean markdown suitable for LLM consumption:

cargo add spider_transformations

use spider::website::Website;
use spider_transformations::transformation::content::transform_content;
use spider_transformations::transformation::content::TransformConfig;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    website.crawl().await;

    let config = TransformConfig::default();

    for page in website.get_pages().unwrap().iter() {
        let html = page.get_html();
        let markdown = transform_content(&html, &page.get_url(), &config);
        println!("--- {} ---\n{}\n", page.get_url(), markdown);
    }
}

The transformer strips navigation, footers, cookie banners, and other boilerplate to produce clean content markdown.

Docker example

Here’s a minimal Dockerfile for a spider-based service:

FROM rust:1.82-slim AS builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src/ ./src/
RUN cargo build --release

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/my-crawler /usr/local/bin/
CMD ["my-crawler"]

Example service that exposes a simple HTTP endpoint:

// src/main.rs
use spider::website::Website;
use spider::configuration::Configuration;
use std::env;

#[tokio::main]
async fn main() {
    let url = env::args().nth(1).unwrap_or_else(|| "https://example.com".into());
    let limit: usize = env::args()
        .nth(2)
        .and_then(|s| s.parse().ok())
        .unwrap_or(50);

    let mut config = Configuration::new();
    config.with_limit(limit);
    config.with_respect_robots_txt(true);

    let mut website = Website::new(&url)
        .with_configuration(config)
        .build()
        .unwrap();

    website.crawl().await;

    let pages = website.get_pages().unwrap();
    println!("Crawled {} pages from {}", pages.len(), url);

    for page in pages.iter() {
        let size = page.get_html_bytes_u8().len();
        println!("  {} ({} bytes)", page.get_url(), size);
    }
}

Build and run:

docker build -t my-crawler .
docker run my-crawler https://docs.example.com 100

Streaming results

For large crawls, you can process pages as they arrive instead of waiting for the full crawl:

use spider::website::Website;
use spider::configuration::Configuration;
use tokio::sync::mpsc;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");

    let (tx, mut rx) = mpsc::unbounded_channel();

    website.on_link_find_callback = Some(Box::new(move |url, _| {
        let _ = tx.send(url.to_string());
        url
    }));

    let crawl_handle = tokio::spawn(async move {
        website.crawl().await;
    });

    while let Some(url) = rx.recv().await {
        println!("Found: {}", url);
    }

    crawl_handle.await.unwrap();
}

When to use the cloud API instead

The OSS crate is the right choice for:

Internal tools and scripts
Research and prototyping
Workloads that don’t hit anti-bot protections
Teams with Rust expertise who want full control

The cloud API makes more sense when you need:

Anti-bot bypass: Cloudflare, DataDome, PerimeterX, Akamai. The crate doesn’t include bypass logic.
Proxy rotation: Managed residential, mobile, and ISP proxies across 100+ countries.
Browser automation: Live WebSocket browser sessions with AI methods (extract(), act(), agent()).
AI extraction: Natural language endpoints that return structured data without writing parsers.
Smart rendering: Auto-detection of which pages need JavaScript rendering (saves compute on static pages).
Managed scaling: No infrastructure to run. Send requests, get results.

Upgrading from self-hosted to cloud

If you start with the crate and later need cloud features, the migration is straightforward. Replace the crate’s crawl logic with an HTTP request to the API:

// Before: spider crate
let mut website = Website::new("https://example.com");
website.crawl().await;
let pages = website.get_pages().unwrap();

// After: cloud API (using reqwest)
let client = reqwest::Client::new();
let response = client
    .post("https://api.spider.cloud/crawl")
    .header("Authorization", format!("Bearer {}", api_key))
    .json(&serde_json::json!({
        "url": "https://example.com",
        "limit": 100,
        "return_format": "markdown",
    }))
    .send()
    .await?;
let pages: Vec<serde_json::Value> = response.json().await?;

Or use the Rust SDK for a higher-level interface:

use spider_client::Spider;

let spider = Spider::new(None)?; // Uses SPIDER_API_KEY env var
let response = spider.crawl_url("https://example.com", Default::default(), false, None).await?;

The response format is the same regardless of whether you use the crate or the API: URL, content, status, and optional metadata per page.

Get your API key. Free credits to start, no card required.

Self-Hosting Spider: Using the Open Source Rust Crate

Self-Hosting Spider: Using the Open Source Rust Crate

What you get (and what you don’t)

Quick start

Configuration

Key configuration options

Markdown conversion

Docker example

Streaming results

When to use the cloud API instead

Upgrading from self-hosted to cloud

Empower any project with AI-ready data