Content & Publishing
Batch your sources,
zero format wrangling
You need full articles from dozens of sites in a consistent format. Not truncated RSS. Not raw HTML. Pass a comma-separated list of URLs to Spider's crawl endpoint. It handles JavaScript rendering when needed, strips the noise with readability extraction, and hands you clean markdown. Same shape every time, regardless of where it came from.
Your scrapers are working breaking
Every custom scraper is a ticking clock. Sites redesign. Paywalls change their cookie flow. JavaScript frameworks swap out the DOM. You find out at 2am when your pipeline goes silent.
One API replaces all of them. Batch your URLs into a single request. No selectors to maintain, no rendering to manage, no format glue code.
From noise to signal
Every web page is 90% navigation, ads, and layout. Spider isolates the article and returns just the content you need.
<nav class="site-header">
<a href="/">Home</a>
<a href="/about">About</a>
... 47 more links ...
</nav>
<div class="ad-banner">
<script src="ads.js"></script>
</div>
<div class="cookie-popup">
<button>Accept All</button>
</div>
<article>
<h1>The Actual Title</h1>
<p>The content you
actually wanted...</p>
</article>
<div class="sidebar">
<div class="related">...</div>
<div class="newsletter">...</div>
</div>
<footer>
... 200 lines of footer ...
</footer> {
"url": "https://example.com/post",
"status": 200,
"metadata": {
"title": "The Actual Title",
"description": "A clear ...",
"keywords": ["web", "dev"],
"og_image": "https://..."
},
"content": "# The Actual Title
\nThe content you actually
wanted, in clean markdown.
\nNo nav. No ads. No
cookie banners.\n..."
} Every source, same shape
Reuters wraps articles in a React app. Substack uses server-rendered HTML. Dev.to has an API but it returns a different schema. Your newsletter tool expects one format. Spider makes that possible.
What your codebase loses (and gains)
Import the client, pass your URLs as a comma-separated list, get clean markdown. The
entire aggregation layer is one function that calls
spider.crawlUrl()
with
readability: true
and
metadata: true
.
Built for teams that ship content daily
Your editorial team reads the news. Your API collects it.
Wire services, local papers, trade publications, competitor blogs. Instead of 40 browser tabs open every morning, your content pipeline delivers a unified feed. Editors spend their time writing and curating, not copying and pasting.
Curate at scale
Pull from your source list, extract the key paragraphs, feed them into your template. What used to take 2 hours of tab-switching becomes one API call.
Track topics across the web
Regulatory updates, competitor announcements, academic pre-prints. Aggregate specialized sources into your analysis workflow or knowledge base.
Feed your RAG pipeline
Clean markdown with consistent metadata. Ready to chunk, embed, and retrieve. Keep your AI grounded in current information, not stale training data.
Under the hood
Smart JavaScript rendering
The default request: "smart" mode detects when a page needs JavaScript and automatically
falls back to Chrome rendering. For JS-heavy sources, set request: "chrome"
to force full browser rendering on every page.
Push, do not poll
Set up a webhook endpoint and Spider pushes content to your app the moment it is ready. No cron jobs checking for updates on a loop.
Structured fields on every page
Enable metadata: true to get title, description, keywords, Open Graph image,
domain, file size, and resource type on every page. Combine with
return_headers: true
for full HTTP response headers.
No per-page surcharges
Crawl 10 URLs or 10,000. No credit multipliers for JavaScript rendering or "premium" domains. Costs stay predictable as your source list grows.