Skip to main content gottem  — one API for every scraper.
RAG pipelines

Clean markdown for your retrieval layer.

RAG quality is a function of what goes into your vector store. If you embedded those docs three months ago and the source has changed since, your retrieval layer is returning outdated context. Spider gives you clean, current markdown so re-embedding always works from the latest source.

crawl_and_embed.py Python
from spider import Spider

client = Spider()
pages = client.crawl_url(
    "https://docs.example.com",
    params={
        "return_format": "markdown",
        "limit": 500,
    }
)

# 342 pages, fast
for page in pages:
    chunks = split(page["content"])
    vectors = embed(chunks)
    db.upsert(vectors, source=page["url"])
01 · Freshness

Embeddings do not update themselves.

The moment your source docs change, your retrieval layer starts drifting from reality. A scheduled re-crawl with content-hash diffing keeps only the changed pages in the re-embedding queue.

Stale pipeline
  1. Jan Crawled docs, built embeddings. Everything is accurate.
  2. Feb Source docs updated. Your embeddings still reflect January.
  3. Mar API endpoints changed. AI cites deprecated methods.
  4. Apr Users get wrong answers. Trust erodes. They stop asking.
Incremental updates
  1. Jan Crawled docs, built embeddings. Everything is accurate.
  2. Feb Re-crawl with Spider, compare content hashes. Only 12 changed pages get re-embedded.
  3. Mar Scheduled re-crawl catches API changes within hours. Pipeline re-embeds only what changed.
  4. Apr AI answers match current documentation. Users trust the system.
02 · What your embeddings see

Embedding models treat every token equally.

If half the tokens are navigation links, cookie banners, and ad scripts, similarity search is matching on noise. Clean input is the single highest-leverage improvement you can make to retrieval quality.

Without Spider ~3,200 tokens
<!DOCTYPE html>
<html><head>
<script src="/analytics.js"></script>
<script src="/hotjar.js"></script>
</head><body>
<nav class="main-nav">
  <a href="/">Home</a>
  <a href="/docs">Docs</a>
  <a href="/pricing">Pricing</a>
  <a href="/blog">Blog</a>
  <a href="/login">Sign In</a>
</nav>
<div class="cookie-banner">
  We use cookies...
</div>
<div class="sidebar">
  <a href="/docs/intro">Introduction</a>
  <a href="/docs/auth">Authentication</a>
  <a href="/docs/api">API Reference</a>
  ... 40 more sidebar links ...
</div>
<article>
  <h1>Authentication</h1>
  <p>Pass your API key as a
  Bearer token in the header.</p>
</article>
<footer>
  ... 200 lines of footer ...
</footer>
<script src="/chat-widget.js"></script>
</body></html>
Most tokens are boilerplate
With Spider ~180 tokens
# Authentication

Pass your API key as a Bearer token
in the Authorization header.

```bash
curl https://api.example.com/v1/data \
  -H "Authorization: Bearer sk-your-key"
```

## Rate Limits

Each key allows 1,000 requests per
minute. Exceeding this returns a
`429 Too Many Requests` response.

## Error Handling

All errors follow a standard format:

```json
{
  "error": {
    "code": "rate_limited",
    "message": "Retry after 60s"
  }
}
```

---
source: docs.example.com/auth
crawled: 2026-04-02T08:14:22Z
Only semantic content remains

Token reduction varies by page. Content-heavy documentation pages typically see 80-95% fewer tokens after Spider strips navigation, scripts, and boilerplate. More pages fit in your context window and every chunk carries actual meaning.

03 · Integrations

Drop into your existing stack.

Spider ships as a native document loader for LangChain and LlamaIndex. Call loader.load(), get back documents with metadata attached, and upsert them into your vector store. Each document includes source URL and crawl timestamp for attribution.

LangChain LlamaIndex CrewAI MCP
LangChain loader Python
from langchain_community.document_loaders \
    import SpiderLoader

loader = SpiderLoader(
    url="https://docs.example.com",
    api_key="your-api-key",
    mode="crawl",
)

documents = loader.load()

# .page_content = clean markdown
# .metadata = source URL, title, timestamp
vector_store.add_documents(documents)
04 · At scale

Built for thousands of pages.

RAG pipelines often cover large documentation sites. Here is what to expect when you scale up.

Streaming

Streaming delivery

Use lazy_load() with the LangChain loader or stream from the API. Pages arrive as they are crawled, so your embedding pipeline starts processing without waiting for the full crawl.

Webhooks

Webhook notifications

Set a webhook URL and Spider POSTs results as pages complete. Useful for large crawls where you want to decouple the crawl from your ingestion pipeline. Events include on_find and on_website_status.

Cost

Cost at volume

Spider charges per page based on bandwidth and compute. Crawling 10,000 documentation pages in markdown mode is a few dollars depending on page size. See the pricing page for current rates.

05 · Resources

Keep reading.

Start

Feed your retrieval layer fresh content.

Your retrieval layer is only as good as its data. Start crawling clean, structured markdown today.

spider crawl --return-format markdown