Skip to main content gottem  — one API for every scraper.
Content aggregation

Batch your sources, get clean markdown back.

Pass a list of URLs to the crawl endpoint. Spider handles JavaScript rendering, strips navigation and ads with readability extraction, and returns the same markdown shape every time, regardless of source.

Aggregated feed Live
  • NEWS 2m ago
    EU passes landmark AI regulation framework
  • BLOG 8m ago
    Building vector search from scratch in Rust
  • DOCS 12m ago
    Cloudflare Workers: new streaming API
  • NEWS 15m ago
    OpenAI open-sources reasoning model weights
4 sources markdown all clean
01 · Why custom scrapers fail

One API replaces every per-site scraper.

Every custom scraper is a ticking clock. Sites redesign. Paywalls change their cookie flow. JavaScript frameworks swap out the DOM. You find out when the pipeline goes silent.

Batch URLs into a single request. No selectors to maintain, no rendering to manage, no format glue code.

pipeline.log 03:41
03:41:12 ERR reuters scraper: selector .article-body not found (site redesign?)
03:41:12 WARN reuters: falling back to RSS, got 2 sentences + "read more"
03:41:14 ERR medium scraper: 403 Forbidden (cloudflare challenge)
03:41:15 ERR substack scraper: empty body (content loads via JS, HTTP-only)
03:41:18 WARN dev.to: HTML returned but readability extraction null
03:41:19 ERR techcrunch: paywall cookie expired, need manual login
03:41:19 INFO pipeline: 1/6 sources returned usable content
03:41:19 INFO paging on-call engineer
02 · Readability extraction

Article only. No nav, ads, or banners.

Every web page is mostly navigation, ads, and layout. Spider isolates the article and returns just the content with structured metadata.

Spider output readability: true · metadata: true
{
  "url": "https://example.com/post",
  "status": 200,
  "metadata": {
    "title": "The Actual Title",
    "description": "A clear ...",
    "keywords": ["web", "dev"],
    "og_image": "https://..."
  },
  "content": "# The Actual Title\n\nThe content you actually wanted, in clean markdown.\nNo nav. No ads. No cookie banners.\n..."
}
03 · Every source, same shape

React SPA, server HTML, REST API. One output.

Reuters wraps articles in a React app. Substack uses server-rendered HTML. Dev.to has an API with a different schema. Spider renders, extracts, and normalizes each into the same markdown plus metadata structure.

reuters.com React SPA
substack.com Server HTML
dev.to REST API
blog.cloudflare.com Static + JS
Unified markdown title + description + keywords + content
04 · What it replaces

Delete the per-site scraper directory.

Deleted 14 files
scrapers/
  reuters.ts
  medium.ts
  substack.ts
  techcrunch.ts
  devto.ts
  ... 4 more
formatters/
  normalize-html.ts
  rss-parser.ts
  metadata-extract.ts
puppeteer-pool.ts
selector-registry.json
Added 1 file
content-feed.ts

Import the client, pass URLs as a comma-separated list, get clean markdown. Calls spider.crawlUrl() with readability: true and metadata: true.

Node.js Python Rust cURL
05 · Who builds with this

Teams shipping content daily.

Newsrooms

Editorial collection

Wire services, local papers, trade publications, competitor blogs. Instead of 40 browser tabs every morning, your content pipeline delivers a unified feed. Editors spend their time writing and curating.

Newsletters

Curate at scale

Pull from your source list, extract key paragraphs, feed them into your template. Hours of tab-switching collapses into one API call.

Research

Track topics across the web

Regulatory updates, competitor announcements, academic pre-prints. Aggregate specialized sources into your analysis workflow or knowledge base.

AI products

Feed your RAG pipeline

Clean markdown with consistent metadata. Ready to chunk, embed, and retrieve. Keep AI grounded in current information instead of training data.

06 · Under the hood

What ships with the API.

Rendering Core

Smart JavaScript rendering

Default request: "smart" mode detects when a page needs JavaScript and falls back to Chrome rendering. For JS-heavy sources, set request: "chrome" to force full browser rendering on every page.

Delivery Webhooks

Push, do not poll

Set up a webhook endpoint and Spider pushes content the moment it is ready. No cron jobs checking for updates on a loop.

Metadata Extraction

Structured fields on every page

Enable metadata: true to get title, description, keywords, Open Graph image, domain, file size, and resource type on every page. Combine with return_headers: true for full HTTP response headers.

Pricing Pay-per-use

JS rendering included

Crawl 10 URLs or 10,000. JavaScript rendering and bot-protection bypass are included; bandwidth and compute scale with what you actually fetch.

07 · Resources

Keep reading.

Start

One API replaces every custom scraper in the pipeline.

Use it alongside RSS or as a complete replacement when feeds are unavailable or truncated.

spider.crawl_url("https://...", params={"readability": True, "return_format": "markdown"})