RAG Pipelines

Your AI hallucinates when its context is stale

RAG quality depends entirely on data freshness. Spider crawls your knowledge sources at 100K+ pages/sec and returns clean, embeddable markdown so your retrieval layer always reflects what is actually on the web right now.

Get Started Read the Guide

crawl_and_embed.py

from spider import Spider

client = Spider()
pages = client.crawl_url(
    "https://docs.example.com",
    params={
        "return_format": "markdown",
        "limit": 500,
    }
)

# 342 pages → clean markdown in 4.2s
for page in pages:
    chunks = split(page["content"])
    vectors = embed(chunks)
    db.upsert(vectors, source=page["url"])

WITHOUT SPIDER

Documentation changes weekly but your embeddings are months old

Raw HTML is noisy — nav bars, ads, footers tank embedding quality

Re-crawling everything is expensive when only 5% of pages changed

Building and maintaining crawlers is a distraction from your actual product

WITH SPIDER

Crawl any site and get back clean markdown — nav, ads, and boilerplate stripped automatically

Incremental updates: only re-fetch pages that actually changed since your last run

Stream results into your embedding pipeline as pages finish — don't wait for the full crawl

Drop-in document loaders for LangChain, LlamaIndex, CrewAI, and MCP

Works with your existing stack

Spider ships as a native LangChain document loader. Crawl any site, get back chunked documents with metadata, and feed them straight into your vector store.

LangChain LlamaIndex CrewAI MCP

from langchain_community.document_loaders import SpiderLoader

loader = SpiderLoader(
    url="https://docs.example.com",
    api_key="your-api-key",
    mode="crawl",
)

documents = loader.load()

# Each document has .page_content (markdown)
# and .metadata (source URL, title, timestamp)
vector_store.add_documents(documents)

Built for RAG workflows

LLM-Ready Markdown

Every page is converted to clean, semantic markdown stripped of navigation, ads, and boilerplate. Output that embedding models actually perform well on.

Incremental Crawling

Only fetch pages that changed since your last run. Delta updates keep your vector store current without re-processing your entire corpus.

Streaming Results

Process documents as they arrive instead of waiting for the full crawl. Start embedding while Spider is still fetching.

Source Attribution

Every document includes its source URL, crawl timestamp, and page metadata. Ground your citations in verifiable sources.

Webhook Delivery

Push results directly to your pipeline via webhooks. New content flows into your vector database the moment it is crawled.

Native Integrations

First-class document loaders for LangChain and LlamaIndex. Drop Spider into your existing stack with zero glue code.

GUIDE