RAG Pipelines
Your AI hallucinates when its context is stale
RAG quality depends entirely on data freshness. Spider crawls your knowledge sources at 100K+ pages/sec and returns clean, embeddable markdown so your retrieval layer always reflects what is actually on the web right now.
from spider import Spider
client = Spider()
pages = client.crawl_url(
"https://docs.example.com",
params={
"return_format": "markdown",
"limit": 500,
}
)
# 342 pages → clean markdown in 4.2s
for page in pages:
chunks = split(page["content"])
vectors = embed(chunks)
db.upsert(vectors, source=page["url"]) WITHOUT SPIDER
WITH SPIDER
Works with your existing stack
Spider ships as a native LangChain document loader. Crawl any site, get back chunked documents with metadata, and feed them straight into your vector store.
from langchain_community.document_loaders import SpiderLoader
loader = SpiderLoader(
url="https://docs.example.com",
api_key="your-api-key",
mode="crawl",
)
documents = loader.load()
# Each document has .page_content (markdown)
# and .metadata (source URL, title, timestamp)
vector_store.add_documents(documents) Built for RAG workflows
LLM-Ready Markdown
Every page is converted to clean, semantic markdown stripped of navigation, ads, and boilerplate. Output that embedding models actually perform well on.
Incremental Crawling
Only fetch pages that changed since your last run. Delta updates keep your vector store current without re-processing your entire corpus.
Streaming Results
Process documents as they arrive instead of waiting for the full crawl. Start embedding while Spider is still fetching.
Source Attribution
Every document includes its source URL, crawl timestamp, and page metadata. Ground your citations in verifiable sources.
Webhook Delivery
Push results directly to your pipeline via webhooks. New content flows into your vector database the moment it is crawled.
Native Integrations
First-class document loaders for LangChain and LlamaIndex. Drop Spider into your existing stack with zero glue code.