The Spider Open Source Ecosystem
Spider’s cloud API is backed by a collection of open source libraries. Each one is independently useful, and you can pull in just the parts you need for your own projects.
This guide covers every OSS project in the ecosystem, what it does, and how to get started with it.
Overview
| Project | Language | License | What it does |
|---|---|---|---|
| spider | Rust | MIT | Core async web crawler |
| spider-browser | TypeScript, Python, Rust | MIT | Browser automation client |
| spider-clients | Python, JS, Rust, Go | MIT | API client SDKs |
| spider_transformations | Rust | MIT | HTML to markdown/text conversion |
| spider_fingerprint | Rust | MIT | TLS/HTTP fingerprinting |
| spider_firewall | Rust | MIT | URL blocking and filtering rules |
spider — Core Crawler
GitHub: spider-rs/spider Crate: crates.io/crates/spider
The core engine. A high-performance async web crawler built on tokio with zero-copy HTML parsing.
Quick start
cargo add spider
use spider::website::Website;
use spider::configuration::Configuration;
#[tokio::main]
async fn main() {
let mut config = Configuration::new();
config.with_limit(50);
config.with_respect_robots_txt(true);
let mut website = Website::new("https://example.com")
.with_configuration(config)
.build()
.unwrap();
website.crawl().await;
for page in website.get_pages().unwrap().iter() {
println!("{}: {} bytes", page.get_url(), page.get_html_bytes_u8().len());
}
}
Key features
- Async concurrent crawling with configurable limits
- Link following with depth control
- Robots.txt compliance
- Domain blacklisting/whitelisting
- External domain support
- Custom user agents
- Configurable delays between requests
See the self-hosting guide for detailed configuration and Docker setup.
spider-browser — Browser Automation Client
GitHub: spider-rs/spider-browser
npm: spider-browser
A WebSocket client for Spider’s browser automation service. Provides CDP (Chrome DevTools Protocol) access plus AI-powered methods for natural language interaction.
Quick start (TypeScript)
npm install spider-browser
import { SpiderBrowser } from "spider-browser";
const spider = new SpiderBrowser({
apiKey: process.env.SPIDER_API_KEY,
stealth: 0, // auto-escalates when blocked
});
await spider.init();
await spider.page.goto("https://example.com");
// Standard CDP methods
const title = await spider.page.title();
console.log("Title:", title);
// AI methods
const data = await spider.page.extract(
"Get the main heading and first paragraph"
);
console.log("Extracted:", data);
await spider.close();
Key features
- Full CDP protocol access (navigate, click, type, scroll, screenshot)
- AI
extract(): describe what data you want in plain English, get structured JSON - AI
act(): describe actions in plain English (“click the login button”) - AI
agent(): multi-step autonomous workflows - AI
observe(): describe what to watch for, get notified when it appears - Stealth mode with auto-escalation when blocked
- Available in TypeScript, Python, and Rust
spider-clients — API Client SDKs
GitHub: spider-rs/spider-clients
Official client libraries for the Spider cloud API. Available in Python, JavaScript/TypeScript, Rust, and Go.
Python
pip install spider-client
from spider import Spider
spider = Spider()
# Crawl a site
pages = spider.crawl_url(
"https://example.com",
params={"limit": 10, "return_format": "markdown"},
)
for page in pages:
print(f"{page['url']}: {len(page['content'])} chars")
# AI extraction
result = spider.ai_scrape(
"https://example.com",
"Extract the main content and any contact information",
)
print(result)
JavaScript/TypeScript
npm install @spider-cloud/spider-client
import { Spider } from "@spider-cloud/spider-client";
const spider = new Spider({ apiKey: process.env.SPIDER_API_KEY });
// Crawl with streaming
await spider.crawlUrl(
"https://example.com",
{ limit: 10, return_format: "markdown" },
true,
(page) => {
console.log(`${page.url}: ${page.content?.length} chars`);
}
);
Go
go get github.com/spider-rs/spider-clients/go
package main
import (
"context"
"fmt"
spider "github.com/spider-rs/spider-clients/go"
)
func main() {
client := spider.New("") // Uses SPIDER_API_KEY env var
pages, err := client.CrawlURL(context.Background(), "https://example.com", &spider.SpiderParams{
Limit: 10,
ReturnFormat: spider.FormatMarkdown,
})
if err != nil {
panic(err)
}
for _, page := range pages {
fmt.Printf("%s: %d chars\n", page.URL, len(page.Content))
}
}
Rust
cargo add spider-client
use spider_client::Spider;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let spider = Spider::new(None)?;
let pages = spider.crawl_url(
"https://example.com",
Default::default(),
false,
None,
).await?;
for page in &pages {
println!("{}: {} chars", page.url, page.content.as_ref().map(|c| c.len()).unwrap_or(0));
}
Ok(())
}
spider_transformations — HTML to Markdown
Part of the spider monorepo. Converts raw HTML into clean markdown optimized for LLM consumption.
Quick start
cargo add spider_transformations
use spider_transformations::transformation::content::{transform_content, TransformConfig};
fn main() {
let html = r#"
<html>
<head><title>Example</title></head>
<body>
<nav>Home | About | Contact</nav>
<main>
<h1>Welcome</h1>
<p>This is the main content of the page.</p>
<ul>
<li>Feature one</li>
<li>Feature two</li>
</ul>
</main>
<footer>Copyright 2026</footer>
</body>
</html>
"#;
let config = TransformConfig::default();
let markdown = transform_content(html, "https://example.com", &config);
println!("{}", markdown);
// Output:
// # Welcome
//
// This is the main content of the page.
//
// - Feature one
// - Feature two
}
What it does
- Strips navigation, sidebars, footers, cookie banners
- Preserves semantic structure (headings, lists, tables, code blocks)
- Converts links to markdown format
- Handles nested HTML structures
- Configurable boilerplate removal
spider_fingerprint — TLS Fingerprinting
Part of the spider monorepo. Manages TLS/HTTP fingerprints to avoid detection by anti-bot systems.
What it does
- Generates realistic TLS client hello fingerprints
- Rotates HTTP/2 settings and header order
- Mimics real browser fingerprint patterns
- Helps avoid bot detection at the TLS layer
This crate is used internally by the spider crawler when the fingerprint feature is enabled.
spider_firewall — URL Blocking Rules
Part of the spider monorepo. Implements URL filtering and blocking rules for crawlers.
What it does
- Pattern-based URL blocking (regex and glob)
- Domain-level allow/deny lists
- Path-based filtering
- Resource type filtering (images, scripts, stylesheets)
Used internally to implement blacklist/whitelist behavior and resource optimization during crawls.
How the pieces fit together
For a self-hosted crawling pipeline:
spider (crawl) → spider_transformations (convert to markdown) → your pipeline
For cloud API usage:
spider-clients (API call) → Spider Cloud → results
For browser automation:
spider-browser (WebSocket) → Spider Browser Cloud → live browser session
The OSS ecosystem gives you building blocks. The cloud API assembles them into a managed service with additional capabilities (proxies, anti-bot, AI, scaling) that aren’t practical to self-host.
Contributing
All Spider OSS projects accept contributions. The main repos:
- Core crawler: spider-rs/spider
- Browser client: spider-rs/spider-browser
- API clients: spider-rs/spider-clients
Issues and PRs welcome. The projects use MIT licensing, so no CLA is required.
Try the cloud API. Free credits to start, no card required.