Guides / The Spider Open Source Ecosystem

The Spider Open Source Ecosystem

A guide to all open source Spider projects — the core crawler, browser client, HTML transformer, TLS fingerprinting, and more. Quick-start examples for each.

5 min read Jeff Mendez

The Spider Open Source Ecosystem

Spider’s cloud API is backed by a collection of open source libraries. Each one is independently useful, and you can pull in just the parts you need for your own projects.

This guide covers every OSS project in the ecosystem, what it does, and how to get started with it.

Overview

ProjectLanguageLicenseWhat it does
spiderRustMITCore async web crawler
spider-browserTypeScript, Python, RustMITBrowser automation client
spider-clientsPython, JS, Rust, GoMITAPI client SDKs
spider_transformationsRustMITHTML to markdown/text conversion
spider_fingerprintRustMITTLS/HTTP fingerprinting
spider_firewallRustMITURL blocking and filtering rules

spider — Core Crawler

GitHub: spider-rs/spider Crate: crates.io/crates/spider

The core engine. A high-performance async web crawler built on tokio with zero-copy HTML parsing.

Quick start

cargo add spider
use spider::website::Website;
use spider::configuration::Configuration;

#[tokio::main]
async fn main() {
    let mut config = Configuration::new();
    config.with_limit(50);
    config.with_respect_robots_txt(true);

    let mut website = Website::new("https://example.com")
        .with_configuration(config)
        .build()
        .unwrap();

    website.crawl().await;

    for page in website.get_pages().unwrap().iter() {
        println!("{}: {} bytes", page.get_url(), page.get_html_bytes_u8().len());
    }
}

Key features

  • Async concurrent crawling with configurable limits
  • Link following with depth control
  • Robots.txt compliance
  • Domain blacklisting/whitelisting
  • External domain support
  • Custom user agents
  • Configurable delays between requests

See the self-hosting guide for detailed configuration and Docker setup.

spider-browser — Browser Automation Client

GitHub: spider-rs/spider-browser npm: spider-browser

A WebSocket client for Spider’s browser automation service. Provides CDP (Chrome DevTools Protocol) access plus AI-powered methods for natural language interaction.

Quick start (TypeScript)

npm install spider-browser
import { SpiderBrowser } from "spider-browser";

const spider = new SpiderBrowser({
  apiKey: process.env.SPIDER_API_KEY,
  stealth: 0, // auto-escalates when blocked
});

await spider.init();
await spider.page.goto("https://example.com");

// Standard CDP methods
const title = await spider.page.title();
console.log("Title:", title);

// AI methods
const data = await spider.page.extract(
  "Get the main heading and first paragraph"
);
console.log("Extracted:", data);

await spider.close();

Key features

  • Full CDP protocol access (navigate, click, type, scroll, screenshot)
  • AI extract(): describe what data you want in plain English, get structured JSON
  • AI act(): describe actions in plain English (“click the login button”)
  • AI agent(): multi-step autonomous workflows
  • AI observe(): describe what to watch for, get notified when it appears
  • Stealth mode with auto-escalation when blocked
  • Available in TypeScript, Python, and Rust

spider-clients — API Client SDKs

GitHub: spider-rs/spider-clients

Official client libraries for the Spider cloud API. Available in Python, JavaScript/TypeScript, Rust, and Go.

Python

pip install spider-client
from spider import Spider

spider = Spider()

# Crawl a site
pages = spider.crawl_url(
    "https://example.com",
    params={"limit": 10, "return_format": "markdown"},
)
for page in pages:
    print(f"{page['url']}: {len(page['content'])} chars")

# AI extraction
result = spider.ai_scrape(
    "https://example.com",
    "Extract the main content and any contact information",
)
print(result)

JavaScript/TypeScript

npm install @spider-cloud/spider-client
import { Spider } from "@spider-cloud/spider-client";

const spider = new Spider({ apiKey: process.env.SPIDER_API_KEY });

// Crawl with streaming
await spider.crawlUrl(
  "https://example.com",
  { limit: 10, return_format: "markdown" },
  true,
  (page) => {
    console.log(`${page.url}: ${page.content?.length} chars`);
  }
);

Go

go get github.com/spider-rs/spider-clients/go
package main

import (
    "context"
    "fmt"
    spider "github.com/spider-rs/spider-clients/go"
)

func main() {
    client := spider.New("")  // Uses SPIDER_API_KEY env var

    pages, err := client.CrawlURL(context.Background(), "https://example.com", &spider.SpiderParams{
        Limit:        10,
        ReturnFormat: spider.FormatMarkdown,
    })
    if err != nil {
        panic(err)
    }

    for _, page := range pages {
        fmt.Printf("%s: %d chars\n", page.URL, len(page.Content))
    }
}

Rust

cargo add spider-client
use spider_client::Spider;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let spider = Spider::new(None)?;

    let pages = spider.crawl_url(
        "https://example.com",
        Default::default(),
        false,
        None,
    ).await?;

    for page in &pages {
        println!("{}: {} chars", page.url, page.content.as_ref().map(|c| c.len()).unwrap_or(0));
    }
    Ok(())
}

spider_transformations — HTML to Markdown

Part of the spider monorepo. Converts raw HTML into clean markdown optimized for LLM consumption.

Quick start

cargo add spider_transformations
use spider_transformations::transformation::content::{transform_content, TransformConfig};

fn main() {
    let html = r#"
        <html>
        <head><title>Example</title></head>
        <body>
            <nav>Home | About | Contact</nav>
            <main>
                <h1>Welcome</h1>
                <p>This is the main content of the page.</p>
                <ul>
                    <li>Feature one</li>
                    <li>Feature two</li>
                </ul>
            </main>
            <footer>Copyright 2026</footer>
        </body>
        </html>
    "#;

    let config = TransformConfig::default();
    let markdown = transform_content(html, "https://example.com", &config);
    println!("{}", markdown);
    // Output:
    // # Welcome
    //
    // This is the main content of the page.
    //
    // - Feature one
    // - Feature two
}

What it does

  • Strips navigation, sidebars, footers, cookie banners
  • Preserves semantic structure (headings, lists, tables, code blocks)
  • Converts links to markdown format
  • Handles nested HTML structures
  • Configurable boilerplate removal

spider_fingerprint — TLS Fingerprinting

Part of the spider monorepo. Manages TLS/HTTP fingerprints to avoid detection by anti-bot systems.

What it does

  • Generates realistic TLS client hello fingerprints
  • Rotates HTTP/2 settings and header order
  • Mimics real browser fingerprint patterns
  • Helps avoid bot detection at the TLS layer

This crate is used internally by the spider crawler when the fingerprint feature is enabled.

spider_firewall — URL Blocking Rules

Part of the spider monorepo. Implements URL filtering and blocking rules for crawlers.

What it does

  • Pattern-based URL blocking (regex and glob)
  • Domain-level allow/deny lists
  • Path-based filtering
  • Resource type filtering (images, scripts, stylesheets)

Used internally to implement blacklist/whitelist behavior and resource optimization during crawls.

How the pieces fit together

For a self-hosted crawling pipeline:

spider (crawl) → spider_transformations (convert to markdown) → your pipeline

For cloud API usage:

spider-clients (API call) → Spider Cloud → results

For browser automation:

spider-browser (WebSocket) → Spider Browser Cloud → live browser session

The OSS ecosystem gives you building blocks. The cloud API assembles them into a managed service with additional capabilities (proxies, anti-bot, AI, scaling) that aren’t practical to self-host.

Contributing

All Spider OSS projects accept contributions. The main repos:

Issues and PRs welcome. The projects use MIT licensing, so no CLA is required.

Try the cloud API. Free credits to start, no card required.

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.