AI & Machine Learning
Data quality beats model size. Every benchmark proves it.
Research consistently shows data quality has outsized impact on model performance. Microsoft's Phi series demonstrated that small models trained on curated text can match larger models trained on unfiltered web data. The bottleneck in most training pipelines is not compute. It is the hours spent writing parsers, filtering boilerplate, and deduplicating URLs. Spider handles the collection and cleaning layer, returning tokenizer-ready markdown with provenance metadata so you can focus on dataset curation.
The token efficiency gap
Same page. Same information. Dramatically different token counts. Illustrative example based on a typical documentation page.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<link rel="stylesheet" href="...">
<script src="analytics.js"></script>
<script src="tracking.js"></script>
</head>
<body>
<nav class="site-nav">
<ul><li>Home</li><li>Docs</li>
<li>API</li><li>Blog</li></ul>
</nav>
<div class="sidebar">...</div>
<main>
<h1>Fine-Tuning Guide</h1>
<p>Prepare your dataset...</p>
</main>
<footer>...280 tokens...</footer>
<script>/* bundle.min.js */</script>
</body></html> ~80% of tokens are nav, scripts, styles, and markup that add zero training signal.
# Fine-Tuning Guide
Prepare your dataset by collecting
high-quality examples that represent
your target domain. Each example
should include an input prompt and
the expected completion.
## Data Format
Use JSONL with one example per line:
```json
{"prompt": "...", "completion": "..."}
```
## Quality Checks
Remove duplicates, filter short
examples, and validate JSON before
starting your training run. Pure semantic content. Every token carries training signal.
Actual reduction varies by page complexity. Navigation-heavy sites see the largest gains, while text-dense pages see smaller but still significant reductions.
From raw web to training-ready in three steps
Point at a domain
Submit a URL with a page limit. Spider maps the entire site, follows sitemaps, and handles pagination. One request covers the full domain. No sitemap parsing, no link extraction logic on your side.
Get clean markdown
Navigation, ads, footers, cookie banners, and script tags are stripped automatically. You get the article body with headings and structure preserved. Each page includes its source URL, title, and crawl timestamp for provenance tracking.
Chunk and train
Pipe the output directly into your tokenizer. Split on headings for natural chunk boundaries. The metadata lets you trace every training example back to its source page, which matters when you need to audit or update your dataset later.
One API, multiple output formats
Markdown, plain text, raw HTML, CommonMark, XML, and more. Switch formats per request with
the return_format parameter.
from spider import Spider
client = Spider()
# Crawl a documentation site for training data
pages = client.crawl_url(
"https://docs.example.com",
params={
"return_format": "markdown",
"limit": 5000,
}
)
# Build your training dataset with provenance
dataset = []
for page in pages:
chunks = split_on_headings(page["content"])
for chunk in chunks:
dataset.append({
"text": chunk,
"source": page["url"],
"crawled_at": page["timestamp"],
})
# Split markdown output into training examples
# Spider also supports server-side chunking via chunking_alg
save_jsonl(dataset, "training_data.jsonl") AI Scraper Guide
Build a resilient web scraper for AI applications, from first crawl to production pipeline.
LangChain
Use Spider as a document loader in LangChain pipelines. Native integration, no glue code.
LlamaIndex
Spider Reader for LlamaIndex data ingestion and index construction.