Clean markdown for your retrieval layer.
RAG quality is a function of what goes into your vector store. If you embedded those docs three months ago and the source has changed since, your retrieval layer is returning outdated context. Spider gives you clean, current markdown so re-embedding always works from the latest source.
from spider import Spider
client = Spider()
pages = client.crawl_url(
"https://docs.example.com",
params={
"return_format": "markdown",
"limit": 500,
}
)
# 342 pages, fast
for page in pages:
chunks = split(page["content"])
vectors = embed(chunks)
db.upsert(vectors, source=page["url"])Embeddings do not update themselves.
The moment your source docs change, your retrieval layer starts drifting from reality. A scheduled re-crawl with content-hash diffing keeps only the changed pages in the re-embedding queue.
- Jan Crawled docs, built embeddings. Everything is accurate.
- Feb Source docs updated. Your embeddings still reflect January.
- Mar API endpoints changed. AI cites deprecated methods.
- Apr Users get wrong answers. Trust erodes. They stop asking.
- Jan Crawled docs, built embeddings. Everything is accurate.
- Feb Re-crawl with Spider, compare content hashes. Only 12 changed pages get re-embedded.
- Mar Scheduled re-crawl catches API changes within hours. Pipeline re-embeds only what changed.
- Apr AI answers match current documentation. Users trust the system.
Embedding models treat every token equally.
If half the tokens are navigation links, cookie banners, and ad scripts, similarity search is matching on noise. Clean input is the single highest-leverage improvement you can make to retrieval quality.
<!DOCTYPE html> <html><head> <script src="/analytics.js"></script> <script src="/hotjar.js"></script> </head><body> <nav class="main-nav"> <a href="/">Home</a> <a href="/docs">Docs</a> <a href="/pricing">Pricing</a> <a href="/blog">Blog</a> <a href="/login">Sign In</a> </nav> <div class="cookie-banner"> We use cookies... </div> <div class="sidebar"> <a href="/docs/intro">Introduction</a> <a href="/docs/auth">Authentication</a> <a href="/docs/api">API Reference</a> ... 40 more sidebar links ... </div> <article> <h1>Authentication</h1> <p>Pass your API key as a Bearer token in the header.</p> </article> <footer> ... 200 lines of footer ... </footer> <script src="/chat-widget.js"></script> </body></html>
# Authentication
Pass your API key as a Bearer token
in the Authorization header.
```bash
curl https://api.example.com/v1/data \
-H "Authorization: Bearer sk-your-key"
```
## Rate Limits
Each key allows 1,000 requests per
minute. Exceeding this returns a
`429 Too Many Requests` response.
## Error Handling
All errors follow a standard format:
```json
{
"error": {
"code": "rate_limited",
"message": "Retry after 60s"
}
}
```
---
source: docs.example.com/auth
crawled: 2026-04-02T08:14:22ZToken reduction varies by page. Content-heavy documentation pages typically see 80-95% fewer tokens after Spider strips navigation, scripts, and boilerplate. More pages fit in your context window and every chunk carries actual meaning.
Drop into your existing stack.
Spider ships as a native document loader for LangChain and LlamaIndex. Call loader.load(), get back documents with metadata attached, and upsert them into your vector store. Each document includes source URL and crawl timestamp for attribution.
from langchain_community.document_loaders \
import SpiderLoader
loader = SpiderLoader(
url="https://docs.example.com",
api_key="your-api-key",
mode="crawl",
)
documents = loader.load()
# .page_content = clean markdown
# .metadata = source URL, title, timestamp
vector_store.add_documents(documents)Built for thousands of pages.
RAG pipelines often cover large documentation sites. Here is what to expect when you scale up.
Streaming delivery
Use lazy_load() with the LangChain loader or stream from the API. Pages arrive as they are crawled, so your embedding pipeline starts processing without waiting for the full crawl.
Webhook notifications
Set a webhook URL and Spider POSTs results as pages complete. Useful for large crawls where you want to decouple the crawl from your ingestion pipeline. Events include on_find and on_website_status.
Cost at volume
Spider charges per page based on bandwidth and compute. Crawling 10,000 documentation pages in markdown mode is a few dollars depending on page size. See the pricing page for current rates.
Keep reading.
Feed your retrieval layer fresh content.
Your retrieval layer is only as good as its data. Start crawling clean, structured markdown today.