Blog / Introducing Silk: Our Custom AI Model for Web Data Extraction

Introducing Silk: Our Custom AI Model for Web Data Extraction

Spider runs Silk, a purpose-built extraction model that converts raw HTML into structured data and solves captchas on dedicated GPU infrastructure. No external API calls, no per-token billing, no data leaving our network.

7 min read Jeff Mendez

Most scraping tools that advertise “AI extraction” are forwarding your HTML to OpenAI or Anthropic behind the scenes. You pay for the round trip in latency, cost, and data exposure. Your pages pass through someone else’s servers. Your bill scales with token count. Your pipeline breaks when their API goes down.

We built something different. Spider runs Silk, our own finetuned extraction model, on dedicated GPU infrastructure inside our network. Zero external API calls. Zero per-token billing. Zero data leaving our VPC.

Why we stopped using third-party models

We started where everyone starts. We sent HTML to GPT-4, got JSON back, and called it a day. Then we looked at the numbers.

Cost was brutal at scale. A dense product page with specs, reviews, and nested pricing tables can easily hit 15,000 tokens of HTML. At GPT-4o rates, that is roughly $0.04 per page for input tokens alone. At a million pages, that is $40,000 in extraction costs sitting on top of your crawl infrastructure.

Latency compounded. Each external API round trip added 1 to 5 seconds per page. At thousands of pages, extraction became the bottleneck. The crawl engine would finish a page in under a second, then sit idle waiting for the LLM to respond.

Reliability was out of our hands. Rate limits, API outages, model deprecations. When a third party’s infrastructure hiccups, your extraction pipeline stops. We had incidents where OpenAI rate limits caused cascading timeouts across entire crawl jobs.

We needed extraction that was fast, predictable, and entirely ours to control.

Silk: purpose-built for extraction

Silk is a compact language model finetuned on a single task: converting messy, real-world HTML into clean structured JSON.

It is not a general-purpose chat model. It will not summarize your meeting notes or draft your emails. Every parameter in Silk is dedicated to understanding the relationship between HTML structure and the data you actually want from it.

What it does

Hand Silk a product page buried under nested divs, inline styles, tracking pixels, and ad containers. It pulls out the product name, price, description, specifications, images, and availability into a consistent JSON schema.

It handles the things that break rule-based extraction: class names that change between page loads, dynamic attributes generated by JavaScript frameworks, DOM structures that vary across the same site. Where CSS selectors and regex patterns are brittle, Silk generalizes.

Why a small model works

Extraction is fundamentally pattern matching between HTML structure and output schema. It does not require world knowledge or multi-step reasoning. A compact, task-specific model handles this efficiently on a single GPU with low latency.

Larger models cost more to run per page without meaningfully improving extraction quality. We tested this extensively. Going from our current architecture to a 70B+ parameter model improved extraction accuracy by less than 2% while increasing per-page compute cost by over 10x. The economics do not justify it.

Silk runs on dedicated GPU instances with optimized inference serving. Typical inference latency is 200 to 800 milliseconds per page depending on HTML complexity. That is 5 to 15 times faster than a round trip to an external API.

Vision models and captcha solving

Silk handles the text extraction side. For pages where the visual layout carries information the HTML does not, we run separate vision-capable models.

Screenshot-aware extraction

Infographics, charts, pricing tables rendered as images, visually-structured layouts where the CSS grid conveys meaning that the markup alone misses. Our vision stack processes screenshots alongside the HTML, combining both signals into a single structured output.

The vision models support native JSON mode, so results come back as valid structured data without post-processing. Multiple models run in parallel for redundancy, with Spider’s routing layer picking the best available model based on the task and current infrastructure health.

Solving captchas without a captcha farm

When a crawl hits a captcha or challenge page, Spider does not hand it off to a third-party solving service. Our vision models analyze the challenge directly and act through the browser.

The system handles the full range of what you encounter in the wild:

  • reCAPTCHA v2 image grids. The model identifies which tiles match the prompt and clicks the correct ones.
  • Cloudflare Turnstile. Browser automation locates the challenge iframe and interacts with it using real browser context.
  • Slider and puzzle captchas. The model calculates the drag path from the puzzle piece to its target position.
  • PerimeterX press-and-hold. The system finds the button element and performs a timed hold action.
  • DataDome. Handled through iframe interaction with the challenge delivery system.
  • Text-based captchas. The model reads the distorted characters and submits the answer.

No humans in the loop. The model sees the challenge, reasons about it, acts through the browser. Everything runs on our infrastructure.

Every successful solve and every failure gets captured as training data. The models improve continuously from real-world crawl traffic.

How it fits into the crawl pipeline

You do not interact with Silk directly. You make a standard API call, and Spider routes everything internally.

curl -X POST https://api.spider.cloud/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/123",
    "return_format": "markdown",
    "extra_ai_data": true
  }'

Here is what happens behind that request:

  1. Fetch. Smart mode decides whether the page needs a lightweight HTTP request or a full browser render.
  2. Extract. Silk processes the HTML and produces structured output. For pages with significant visual content, the vision model processes a screenshot alongside the markup.
  3. Fallback. If the primary model is unavailable or returns low-confidence output, Spider falls back through the model stack automatically. No intervention needed.
  4. Stream. Results flow back as they complete. You get clean, structured data without knowing or caring which model produced it.

The extra_ai_data flag activates the full AI extraction pipeline. Without it, you get standard readability-based markdown conversion, which already handles most use cases well. With it, Silk and the vision models handle complex pages where rule-based approaches fall apart.

What this costs

ApproachCost per 1K pagesLatency per pageYour data leaves?
Spider (Silk)Included in crawl credits200-800msNo
GPT-4o via API~$40 (input tokens)1-5sYes
Claude Sonnet via API~$241-4sYes
Gemini Flash via API~$30.5-2sYes
Self-hosted large model~$8 (GPU compute)1-3sNo

Spider’s extraction cost is baked into the per-page crawl credit. No separate line item. When you pay $0.48 per 1,000 pages, that includes Silk’s inference.

We can afford to include it because Silk is small and efficient. A compact, task-specific model on a single GPU processes pages at a fraction of the compute cost of a general-purpose 70B+ parameter model.

Why finetuning beats prompting

We tried the prompt engineering route for months before building Silk. Here is why we moved on.

Consistency. A general-purpose model with a system prompt produces slightly different output structures on every call. Field names drift. Nesting changes. Optional fields appear and disappear. Silk was trained on thousands of extraction examples where consistency is the objective. It produces the same schema shape every time.

Speed. Silk does not waste tokens on preambles, disclaimers, or asking if you need anything else. It sees HTML, it outputs JSON. That directness translates to lower latency and lower compute cost per page.

Predictable costs. Fixed GPU infrastructure means we know exactly what extraction costs per page. No surprise bills from a provider changing their token pricing overnight.

Independence. No rate limits from external providers. No dependency on someone else’s uptime. If our GPU is healthy, extraction works. Full stop.

What this means for your stack

The extraction layer determines the quality of everything downstream. If you are building a RAG pipeline, training dataset, or AI agent that consumes web data, bad extraction means bad embeddings, bad retrieval, and bad answers.

With Silk:

  • Your pipeline does not depend on OpenAI’s uptime. Or Anthropic’s, or Google’s. Your extraction runs on infrastructure we control.
  • Your costs are predictable. Credits, not token metering. You know what a crawl will cost before you start it.
  • Your data stays contained. Page content never leaves Spider’s GPU infrastructure. It is never forwarded to a third-party model provider.
  • Captchas do not break your crawl. When a challenge page appears mid-job, Spider solves it and keeps going. No manual intervention, no third-party CAPTCHA service, no gaps in your data.

We are training Silk continuously. Every crawl that runs through Spider generates signal that makes the next extraction better. The feedback loop is built into the product, and it compounds over time.

Try it

Silk is available on all Spider plans, including free credits. Standard extraction works out of the box. For the full AI pipeline with vision model support, add extra_ai_data: true to your request.

Get started free

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.