Self-Hosting Spider: Using the Open Source Rust Crate
The spider Rust crate is the open source core of Spider’s crawling engine. It’s MIT-licensed and runs standalone, so you can build your own crawler without touching the cloud API.
This guide covers what you get with the crate, how to set it up, and when the cloud API makes more sense.
What you get (and what you don’t)
The spider crate gives you a high-performance async web crawler built on tokio. Here’s what’s included and what’s only available through the cloud API:
| Feature | OSS crate | Cloud API |
|---|---|---|
| HTML crawling | Yes | Yes |
| Async concurrent requests | Yes | Yes |
| Depth/limit control | Yes | Yes |
| Robots.txt handling | Yes | Yes |
| CSS selector extraction | Yes | Yes |
| HTTP/2 support | Yes | Yes |
| Configurable user agent | Yes | Yes |
| Markdown conversion | Via spider_transformations | Built-in |
| JavaScript rendering | Chrome feature flag | Smart mode (auto-detect) |
| Anti-bot bypass | No | Built-in |
| Proxy rotation | No (BYO) | Managed (residential, mobile, ISP) |
| AI extraction | No | AI Studio + Spider Browser |
| Managed scaling | No | Auto-scaling |
| Browser automation | No | WebSocket sessions |
| MCP server | No | Built-in |
The crate handles the crawling engine. Anti-bot bypass, proxy rotation, browser automation, and AI features are part of the cloud platform.
Quick start
Add the crate to your project:
cargo add spider
Basic crawl:
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
website.crawl().await;
for page in website.get_pages().unwrap().iter() {
println!("{} - {} bytes", page.get_url(), page.get_html_bytes_u8().len());
}
}
This crawls example.com, follows links on the same domain, and prints each page’s URL and size.
Configuration
The Configuration struct controls crawl behavior:
use spider::website::Website;
use spider::configuration::Configuration;
#[tokio::main]
async fn main() {
let mut config = Configuration::new();
config.with_limit(100); // Max 100 pages
config.with_depth(3); // Max 3 links deep
config.with_respect_robots_txt(true); // Honor robots.txt
config.with_delay(250); // 250ms between requests
config.with_user_agent(Some("MyBot/1.0".into()));
let mut website = Website::new("https://docs.example.com")
.with_configuration(config)
.build()
.unwrap();
website.crawl().await;
println!("Crawled {} pages", website.get_pages().unwrap().len());
}
Key configuration options
| Option | Method | Description |
|---|---|---|
| Page limit | with_limit(n) | Maximum number of pages to crawl |
| Depth | with_depth(n) | Maximum link depth from start URL |
| Delay | with_delay(ms) | Milliseconds between requests to same domain |
| User agent | with_user_agent(Some(s)) | Custom User-Agent header |
| Robots.txt | with_respect_robots_txt(true) | Honor robots.txt rules |
| Subdomains | with_subdomains(true) | Include subdomains in crawl |
| TLD | with_tld(true) | Crawl all subdomains under the TLD |
| External domains | with_external_domains(vec) | Allow crawling specific external domains |
| Blacklist | with_blacklist_url(vec) | URL patterns to skip |
Markdown conversion
The spider_transformations crate converts HTML to clean markdown suitable for LLM consumption:
cargo add spider_transformations
use spider::website::Website;
use spider_transformations::transformation::content::transform_content;
use spider_transformations::transformation::content::TransformConfig;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
website.crawl().await;
let config = TransformConfig::default();
for page in website.get_pages().unwrap().iter() {
let html = page.get_html();
let markdown = transform_content(&html, &page.get_url(), &config);
println!("--- {} ---\n{}\n", page.get_url(), markdown);
}
}
The transformer strips navigation, footers, cookie banners, and other boilerplate to produce clean content markdown.
Docker example
Here’s a minimal Dockerfile for a spider-based service:
FROM rust:1.82-slim AS builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src/ ./src/
RUN cargo build --release
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/my-crawler /usr/local/bin/
CMD ["my-crawler"]
Example service that exposes a simple HTTP endpoint:
// src/main.rs
use spider::website::Website;
use spider::configuration::Configuration;
use std::env;
#[tokio::main]
async fn main() {
let url = env::args().nth(1).unwrap_or_else(|| "https://example.com".into());
let limit: usize = env::args()
.nth(2)
.and_then(|s| s.parse().ok())
.unwrap_or(50);
let mut config = Configuration::new();
config.with_limit(limit);
config.with_respect_robots_txt(true);
let mut website = Website::new(&url)
.with_configuration(config)
.build()
.unwrap();
website.crawl().await;
let pages = website.get_pages().unwrap();
println!("Crawled {} pages from {}", pages.len(), url);
for page in pages.iter() {
let size = page.get_html_bytes_u8().len();
println!(" {} ({} bytes)", page.get_url(), size);
}
}
Build and run:
docker build -t my-crawler .
docker run my-crawler https://docs.example.com 100
Streaming results
For large crawls, you can process pages as they arrive instead of waiting for the full crawl:
use spider::website::Website;
use spider::configuration::Configuration;
use tokio::sync::mpsc;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
let (tx, mut rx) = mpsc::unbounded_channel();
website.on_link_find_callback = Some(Box::new(move |url, _| {
let _ = tx.send(url.to_string());
url
}));
let crawl_handle = tokio::spawn(async move {
website.crawl().await;
});
while let Some(url) = rx.recv().await {
println!("Found: {}", url);
}
crawl_handle.await.unwrap();
}
When to use the cloud API instead
The OSS crate is the right choice for:
- Internal tools and scripts
- Research and prototyping
- Workloads that don’t hit anti-bot protections
- Teams with Rust expertise who want full control
The cloud API makes more sense when you need:
- Anti-bot bypass: Cloudflare, DataDome, PerimeterX, Akamai. The crate doesn’t include bypass logic.
- Proxy rotation: Managed residential, mobile, and ISP proxies across 100+ countries.
- Browser automation: Live WebSocket browser sessions with AI methods (
extract(),act(),agent()). - AI extraction: Natural language endpoints that return structured data without writing parsers.
- Smart rendering: Auto-detection of which pages need JavaScript rendering (saves compute on static pages).
- Managed scaling: No infrastructure to run. Send requests, get results.
Upgrading from self-hosted to cloud
If you start with the crate and later need cloud features, the migration is straightforward. Replace the crate’s crawl logic with an HTTP request to the API:
// Before: spider crate
let mut website = Website::new("https://example.com");
website.crawl().await;
let pages = website.get_pages().unwrap();
// After: cloud API (using reqwest)
let client = reqwest::Client::new();
let response = client
.post("https://api.spider.cloud/crawl")
.header("Authorization", format!("Bearer {}", api_key))
.json(&serde_json::json!({
"url": "https://example.com",
"limit": 100,
"return_format": "markdown",
}))
.send()
.await?;
let pages: Vec<serde_json::Value> = response.json().await?;
Or use the Rust SDK for a higher-level interface:
use spider_client::Spider;
let spider = Spider::new(None)?; // Uses SPIDER_API_KEY env var
let response = spider.crawl_url("https://example.com", Default::default(), false, None).await?;
The response format is the same regardless of whether you use the crate or the API: URL, content, status, and optional metadata per page.
Get your API key. Free credits to start, no card required.