Preserve websites before they change.
Regulatory pages update. Terms of service disappear. Spider builds incremental archives that track what changed between crawls so every version is accounted for.
- 2026-02-18 v14 +3 pages changed
- 2026-02-11 v13 +1 page changed
- 2026-02-04 v12 +7 pages changed
- 2026-01-28 v11 no changes
- 2026-01-21 v10 +2 pages changed
- 2026-01-14 v9 +12 pages changed
- 2026-01-07 v8 initial crawl
Only fetch what changed.
Spider stores crawl data and compares it against the live site on each subsequent run. Pages that have not changed are skipped. A documentation portal that updates a few pages per week will see large bandwidth savings; a news site with daily churn will see less. Either way, you only pay for what actually changed between snapshots.
Only changed pages are re-fetched on subsequent runs.
Crawl data is persisted between runs for comparison.
Full version history maintained across every run.
Capture, store, and deliver every version.
Full resource capture
Store complete snapshots including HTML, images, stylesheets, and scripts. Every crawl captures the page as it appeared in the browser, not a stripped-down approximation.
Compliance metadata
Every archived page includes canonical URL, HTTP status, and request duration. Enable return_headers to capture full response headers. Add your own capture timestamps at ingest for a complete audit trail.
Multiple output formats
Store as raw HTML, clean markdown, or plain text. Request multiple formats in a single crawl with return_format: "raw,markdown". Pick the format that fits your archive system.
Direct-to-storage delivery
Stream archived pages to Amazon S3, Google Cloud Storage, Azure Blob, or Supabase via data connectors as they are crawled. Webhooks fire on page discovery for custom integrations.
Scope control
Archive specific sections, subsites, or full domains. Set page limits per run and configure crawl depth to match your retention policy and budget.
Scheduled crawls
Set up daily, weekly, or monthly archive runs. Each run detects changes and only fetches the delta. Costs stay predictable as the archive grows over months and years.
Schedule weekly. Pay for the delta.
Spider detects which pages changed since the last snapshot and only re-fetches those. Pair with data connectors or webhooks to push new versions directly into archive storage.
# Incremental archive crawl
import spider
client = spider.Spider()
# First run crawls everything.
# Subsequent runs only fetch changed pages.
result = client.crawl_url(
"https://docs.example.com",
params={
"limit": 500,
"metadata": True,
"return_format": "raw,markdown",
"return_headers": True,
"store_data": True,
},
)
# Each page includes url, status, content
for page in result:
print(page["url"], page["status"])Keep reading.
Start building your web archive.
Incremental crawling keeps costs predictable and your records complete. No infrastructure to manage.