Content aggregation

Batch your sources, get clean markdown back.

Pass a list of URLs to the crawl endpoint. Spider handles JavaScript rendering, strips navigation and ads with readability extraction, and returns the same markdown shape every time, regardless of source.

Get started free API docs

Aggregated feed Live

NEWS 2m ago
EU passes landmark AI regulation framework
BLOG 8m ago
Building vector search from scratch in Rust
DOCS 12m ago
Cloudflare Workers: new streaming API
NEWS 15m ago
OpenAI open-sources reasoning model weights

4 sources markdown all clean

01 · Why custom scrapers fail

One API replaces every per-site scraper.

Every custom scraper is a ticking clock. Sites redesign. Paywalls change their cookie flow. JavaScript frameworks swap out the DOM. You find out when the pipeline goes silent.

Batch URLs into a single request. No selectors to maintain, no rendering to manage, no format glue code.

pipeline.log 03:41

03:41:12 ERR reuters scraper: selector .article-body not found (site redesign?)

03:41:12 WARN reuters: falling back to RSS, got 2 sentences + "read more"

03:41:14 ERR medium scraper: 403 Forbidden (cloudflare challenge)

03:41:15 ERR substack scraper: empty body (content loads via JS, HTTP-only)

03:41:18 WARN dev.to: HTML returned but readability extraction null

03:41:19 ERR techcrunch: paywall cookie expired, need manual login

03:41:19 INFO pipeline: 1/6 sources returned usable content

03:41:19 INFO paging on-call engineer

02 · Readability extraction

Article only. No nav, ads, or banners.

Every web page is mostly navigation, ads, and layout. Spider isolates the article and returns just the content with structured metadata.

Spider output readability: true · metadata: true

{
  "url": "https://example.com/post",
  "status": 200,
  "metadata": {
    "title": "The Actual Title",
    "description": "A clear ...",
    "keywords": ["web", "dev"],
    "og_image": "https://..."
  },
  "content": "# The Actual Title\n\nThe content you actually wanted, in clean markdown.\nNo nav. No ads. No cookie banners.\n..."
}

03 · Every source, same shape

React SPA, server HTML, REST API. One output.

Reuters wraps articles in a React app. Substack uses server-rendered HTML. Dev.to has an API with a different schema. Spider renders, extracts, and normalizes each into the same markdown plus metadata structure.

reuters.com React SPA

substack.com Server HTML

dev.to REST API

blog.cloudflare.com Static + JS

Unified markdown title + description + keywords + content

04 · What it replaces

Delete the per-site scraper directory.

Deleted 14 files

scrapers/
  reuters.ts
  medium.ts
  substack.ts
  techcrunch.ts
  devto.ts
  ... 4 more
formatters/
  normalize-html.ts
  rss-parser.ts
  metadata-extract.ts
puppeteer-pool.ts
selector-registry.json

Added 1 file

content-feed.ts

Import the client, pass URLs as a comma-separated list, get clean markdown. Calls spider.crawlUrl() with readability: true and metadata: true.

Node.js Python Rust cURL

05 · Who builds with this

Teams shipping content daily.

Newsrooms

Editorial collection

Wire services, local papers, trade publications, competitor blogs. Instead of 40 browser tabs every morning, your content pipeline delivers a unified feed. Editors spend their time writing and curating.

Newsletters

Curate at scale

Pull from your source list, extract key paragraphs, feed them into your template. Hours of tab-switching collapses into one API call.

Research

Track topics across the web

Regulatory updates, competitor announcements, academic pre-prints. Aggregate specialized sources into your analysis workflow or knowledge base.

AI products

Feed your RAG pipeline

Clean markdown with consistent metadata. Ready to chunk, embed, and retrieve. Keep AI grounded in current information instead of training data.

06 · Under the hood

What ships with the API.

Rendering Core

Smart JavaScript rendering

Default request: "smart" mode detects when a page needs JavaScript and falls back to full browser rendering. For JS-heavy sources, set request: "browser" to force full browser rendering on every page.

Delivery Webhooks

Push, do not poll

Set up a webhook endpoint and Spider pushes content the moment it is ready. No cron jobs checking for updates on a loop.

Metadata Extraction

Structured fields on every page

Enable metadata: true to get title, description, keywords, Open Graph image, domain, file size, and resource type on every page. Combine with return_headers: true for full HTTP response headers.

Pricing Pay-per-use

JS rendering included

Crawl 10 URLs or 10,000. JavaScript rendering and bot-protection bypass are included; bandwidth and compute scale with what you actually fetch.

07 · Resources

Keep reading.

Docs

One API replaces every custom scraper in the pipeline.

Use it alongside RSS or as a complete replacement when feeds are unavailable or truncated.

spider.crawl_url("https://...", params={"readability": True, "return_format": "markdown"})

Get started free API reference