Batch your sources, get clean markdown back.
Pass a list of URLs to the crawl endpoint. Spider handles JavaScript rendering, strips navigation and ads with readability extraction, and returns the same markdown shape every time, regardless of source.
- NEWS 2m agoEU passes landmark AI regulation framework
- BLOG 8m agoBuilding vector search from scratch in Rust
- DOCS 12m agoCloudflare Workers: new streaming API
- NEWS 15m agoOpenAI open-sources reasoning model weights
One API replaces every per-site scraper.
Every custom scraper is a ticking clock. Sites redesign. Paywalls change their cookie flow. JavaScript frameworks swap out the DOM. You find out when the pipeline goes silent.
Batch URLs into a single request. No selectors to maintain, no rendering to manage, no format glue code.
Article only. No nav, ads, or banners.
Every web page is mostly navigation, ads, and layout. Spider isolates the article and returns just the content with structured metadata.
{
"url": "https://example.com/post",
"status": 200,
"metadata": {
"title": "The Actual Title",
"description": "A clear ...",
"keywords": ["web", "dev"],
"og_image": "https://..."
},
"content": "# The Actual Title\n\nThe content you actually wanted, in clean markdown.\nNo nav. No ads. No cookie banners.\n..."
}React SPA, server HTML, REST API. One output.
Reuters wraps articles in a React app. Substack uses server-rendered HTML. Dev.to has an API with a different schema. Spider renders, extracts, and normalizes each into the same markdown plus metadata structure.
Delete the per-site scraper directory.
scrapers/ reuters.ts medium.ts substack.ts techcrunch.ts devto.ts ... 4 more formatters/ normalize-html.ts rss-parser.ts metadata-extract.ts puppeteer-pool.ts selector-registry.json
content-feed.ts
Import the client, pass URLs as a comma-separated list, get clean markdown. Calls spider.crawlUrl() with readability: true and metadata: true.
Teams shipping content daily.
Editorial collection
Wire services, local papers, trade publications, competitor blogs. Instead of 40 browser tabs every morning, your content pipeline delivers a unified feed. Editors spend their time writing and curating.
Curate at scale
Pull from your source list, extract key paragraphs, feed them into your template. Hours of tab-switching collapses into one API call.
Track topics across the web
Regulatory updates, competitor announcements, academic pre-prints. Aggregate specialized sources into your analysis workflow or knowledge base.
Feed your RAG pipeline
Clean markdown with consistent metadata. Ready to chunk, embed, and retrieve. Keep AI grounded in current information instead of training data.
What ships with the API.
Smart JavaScript rendering
Default request: "smart" mode detects when a page needs JavaScript and falls back to Chrome rendering. For JS-heavy sources, set request: "chrome" to force full browser rendering on every page.
Push, do not poll
Set up a webhook endpoint and Spider pushes content the moment it is ready. No cron jobs checking for updates on a loop.
Structured fields on every page
Enable metadata: true to get title, description, keywords, Open Graph image, domain, file size, and resource type on every page. Combine with return_headers: true for full HTTP response headers.
JS rendering included
Crawl 10 URLs or 10,000. JavaScript rendering and bot-protection bypass are included; bandwidth and compute scale with what you actually fetch.
Keep reading.
One API replaces every custom scraper in the pipeline.
Use it alongside RSS or as a complete replacement when feeds are unavailable or truncated.