Guides / Self-Hosting Spider: Using the Open Source Rust Crate

Self-Hosting Spider: Using the Open Source Rust Crate

Build your own web crawler with the open source spider Rust crate. Quick start, Docker setup, configuration, and when to upgrade to the cloud API.

4 min read Jeff Mendez

Self-Hosting Spider: Using the Open Source Rust Crate

The spider Rust crate is the open source core of Spider’s crawling engine. It’s MIT-licensed and runs standalone, so you can build your own crawler without touching the cloud API.

This guide covers what you get with the crate, how to set it up, and when the cloud API makes more sense.

What you get (and what you don’t)

The spider crate gives you a high-performance async web crawler built on tokio. Here’s what’s included and what’s only available through the cloud API:

FeatureOSS crateCloud API
HTML crawlingYesYes
Async concurrent requestsYesYes
Depth/limit controlYesYes
Robots.txt handlingYesYes
CSS selector extractionYesYes
HTTP/2 supportYesYes
Configurable user agentYesYes
Markdown conversionVia spider_transformationsBuilt-in
JavaScript renderingChrome feature flagSmart mode (auto-detect)
Anti-bot bypassNoBuilt-in
Proxy rotationNo (BYO)Managed (residential, mobile, ISP)
AI extractionNoAI Studio + Spider Browser
Managed scalingNoAuto-scaling
Browser automationNoWebSocket sessions
MCP serverNoBuilt-in

The crate handles the crawling engine. Anti-bot bypass, proxy rotation, browser automation, and AI features are part of the cloud platform.

Quick start

Add the crate to your project:

cargo add spider

Basic crawl:

use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");

    website.crawl().await;

    for page in website.get_pages().unwrap().iter() {
        println!("{} - {} bytes", page.get_url(), page.get_html_bytes_u8().len());
    }
}

This crawls example.com, follows links on the same domain, and prints each page’s URL and size.

Configuration

The Configuration struct controls crawl behavior:

use spider::website::Website;
use spider::configuration::Configuration;

#[tokio::main]
async fn main() {
    let mut config = Configuration::new();
    config.with_limit(100);              // Max 100 pages
    config.with_depth(3);                // Max 3 links deep
    config.with_respect_robots_txt(true); // Honor robots.txt
    config.with_delay(250);               // 250ms between requests
    config.with_user_agent(Some("MyBot/1.0".into()));

    let mut website = Website::new("https://docs.example.com")
        .with_configuration(config)
        .build()
        .unwrap();

    website.crawl().await;

    println!("Crawled {} pages", website.get_pages().unwrap().len());
}

Key configuration options

OptionMethodDescription
Page limitwith_limit(n)Maximum number of pages to crawl
Depthwith_depth(n)Maximum link depth from start URL
Delaywith_delay(ms)Milliseconds between requests to same domain
User agentwith_user_agent(Some(s))Custom User-Agent header
Robots.txtwith_respect_robots_txt(true)Honor robots.txt rules
Subdomainswith_subdomains(true)Include subdomains in crawl
TLDwith_tld(true)Crawl all subdomains under the TLD
External domainswith_external_domains(vec)Allow crawling specific external domains
Blacklistwith_blacklist_url(vec)URL patterns to skip

Markdown conversion

The spider_transformations crate converts HTML to clean markdown suitable for LLM consumption:

cargo add spider_transformations
use spider::website::Website;
use spider_transformations::transformation::content::transform_content;
use spider_transformations::transformation::content::TransformConfig;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    website.crawl().await;

    let config = TransformConfig::default();

    for page in website.get_pages().unwrap().iter() {
        let html = page.get_html();
        let markdown = transform_content(&html, &page.get_url(), &config);
        println!("--- {} ---\n{}\n", page.get_url(), markdown);
    }
}

The transformer strips navigation, footers, cookie banners, and other boilerplate to produce clean content markdown.

Docker example

Here’s a minimal Dockerfile for a spider-based service:

FROM rust:1.82-slim AS builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src/ ./src/
RUN cargo build --release

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/my-crawler /usr/local/bin/
CMD ["my-crawler"]

Example service that exposes a simple HTTP endpoint:

// src/main.rs
use spider::website::Website;
use spider::configuration::Configuration;
use std::env;

#[tokio::main]
async fn main() {
    let url = env::args().nth(1).unwrap_or_else(|| "https://example.com".into());
    let limit: usize = env::args()
        .nth(2)
        .and_then(|s| s.parse().ok())
        .unwrap_or(50);

    let mut config = Configuration::new();
    config.with_limit(limit);
    config.with_respect_robots_txt(true);

    let mut website = Website::new(&url)
        .with_configuration(config)
        .build()
        .unwrap();

    website.crawl().await;

    let pages = website.get_pages().unwrap();
    println!("Crawled {} pages from {}", pages.len(), url);

    for page in pages.iter() {
        let size = page.get_html_bytes_u8().len();
        println!("  {} ({} bytes)", page.get_url(), size);
    }
}

Build and run:

docker build -t my-crawler .
docker run my-crawler https://docs.example.com 100

Streaming results

For large crawls, you can process pages as they arrive instead of waiting for the full crawl:

use spider::website::Website;
use spider::configuration::Configuration;
use tokio::sync::mpsc;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");

    let (tx, mut rx) = mpsc::unbounded_channel();

    website.on_link_find_callback = Some(Box::new(move |url, _| {
        let _ = tx.send(url.to_string());
        url
    }));

    let crawl_handle = tokio::spawn(async move {
        website.crawl().await;
    });

    while let Some(url) = rx.recv().await {
        println!("Found: {}", url);
    }

    crawl_handle.await.unwrap();
}

When to use the cloud API instead

The OSS crate is the right choice for:

  • Internal tools and scripts
  • Research and prototyping
  • Workloads that don’t hit anti-bot protections
  • Teams with Rust expertise who want full control

The cloud API makes more sense when you need:

  • Anti-bot bypass: Cloudflare, DataDome, PerimeterX, Akamai. The crate doesn’t include bypass logic.
  • Proxy rotation: Managed residential, mobile, and ISP proxies across 100+ countries.
  • Browser automation: Live WebSocket browser sessions with AI methods (extract(), act(), agent()).
  • AI extraction: Natural language endpoints that return structured data without writing parsers.
  • Smart rendering: Auto-detection of which pages need JavaScript rendering (saves compute on static pages).
  • Managed scaling: No infrastructure to run. Send requests, get results.

Upgrading from self-hosted to cloud

If you start with the crate and later need cloud features, the migration is straightforward. Replace the crate’s crawl logic with an HTTP request to the API:

// Before: spider crate
let mut website = Website::new("https://example.com");
website.crawl().await;
let pages = website.get_pages().unwrap();

// After: cloud API (using reqwest)
let client = reqwest::Client::new();
let response = client
    .post("https://api.spider.cloud/crawl")
    .header("Authorization", format!("Bearer {}", api_key))
    .json(&serde_json::json!({
        "url": "https://example.com",
        "limit": 100,
        "return_format": "markdown",
    }))
    .send()
    .await?;
let pages: Vec<serde_json::Value> = response.json().await?;

Or use the Rust SDK for a higher-level interface:

use spider_client::Spider;

let spider = Spider::new(None)?; // Uses SPIDER_API_KEY env var
let response = spider.crawl_url("https://example.com", Default::default(), false, None).await?;

The response format is the same regardless of whether you use the crate or the API: URL, content, status, and optional metadata per page.

Get your API key. Free credits to start, no card required.

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.