Skip to main content gottem  — one API for every scraper.
AI training data

Tokenizer-ready markdown with provenance.

Research consistently shows data quality has outsized impact on model performance. Microsoft's Phi series demonstrated that small models trained on curated text can match larger models trained on unfiltered web data. Spider handles the collection and cleaning layer so you can focus on dataset curation.

Internal test 1,000 page docs site
  • 12.4 MB clean markdown output
  • 4.2 s total crawl time
  • 0 scrapers to maintain
01 · Token efficiency

Same page, dramatically different token counts.

Illustrative example based on a typical documentation page. Token reduction varies by page complexity. Navigation-heavy sites see the largest gains; text-dense pages see smaller but still significant reductions.

Raw HTML ~2,400 tokens/page
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <link rel="stylesheet" href="...">
  <script src="analytics.js"></script>
  <script src="tracking.js"></script>
</head>
<body>
  <nav class="site-nav">
    <ul><li>Home</li><li>Docs</li>
    <li>API</li><li>Blog</li></ul>
  </nav>
  <div class="sidebar">...</div>
  <main>
    <h1>Fine-Tuning Guide</h1>
    <p>Prepare your dataset...</p>
  </main>
  <footer>...280 tokens...</footer>
  <script>/* bundle.min.js */</script>
</body></html>
~80% nav, scripts, styles, markup
Spider markdown ~450 tokens/page
# Fine-Tuning Guide

Prepare your dataset by collecting
high-quality examples that represent
your target domain. Each example
should include an input prompt and
the expected completion.

## Data Format

Use JSONL with one example per line:

```json
{"prompt": "...", "completion": "..."}
```

## Quality Checks

Remove duplicates, filter short
examples, and validate JSON before
starting your training run.
Pure semantic content
~80% fewer tokens same information
02 · Pipeline

From raw web to training-ready in three steps.

Step 01

Point at a domain

Submit a URL with a page limit. Spider maps the entire site, follows sitemaps, and handles pagination. One request covers the full domain. No sitemap parsing, no link extraction logic on your side.

Step 02

Get clean markdown

Navigation, ads, footers, cookie banners, and script tags are stripped automatically. You get the article body with headings and structure preserved. Each page includes its source URL, title, and crawl timestamp for provenance tracking.

Step 03

Chunk and train

Pipe the output directly into your tokenizer. Split on headings for natural chunk boundaries. The metadata lets you trace every training example back to its source page, which matters when you need to audit or update your dataset later.

03 · Output formats

One API, multiple output formats.

Markdown, plain text, raw HTML, CommonMark, XML, and more. Switch formats per request with the return_format parameter.

markdown text raw commonmark xml
training_pipeline.py Python
from spider import Spider

client = Spider()

# Crawl a documentation site for training data
pages = client.crawl_url(
    "https://docs.example.com",
    params={
        "return_format": "markdown",
        "limit": 5000,
    }
)

# Build your training dataset with provenance
dataset = []
for page in pages:
    chunks = split_on_headings(page["content"])
    for chunk in chunks:
        dataset.append({
            "text": chunk,
            "source": page["url"],
            "crawled_at": page["timestamp"],
        })

# Split markdown output into training examples
# Spider also supports server-side chunking via chunking_alg
save_jsonl(dataset, "training_data.jsonl")
04 · Resources

Keep reading.

Start

Better data, fewer scrapers.

Start collecting clean, structured training data in minutes. No scrapers to build, no HTML to parse, no boilerplate to filter.

spider crawl --format markdown --limit 10000