NEW AI Studio is now available Try it now
Data Preservation

Archive websites for the long term

Preserve website content for compliance, research, or historical records. Spider's incremental crawling captures only what changed since your last run, building comprehensive archives without redundant data or wasted resources.

Archive History docs.example.com
2026-02-18 v14 +3 pages changed
2026-02-11 v13 +1 page changed
2026-02-04 v12 +7 pages changed
2026-01-28 v11 no changes
2026-01-21 v10 +2 pages changed
2026-01-14 v9 +12 pages changed
2026-01-07 v8 initial crawl
7 snapshots 234 pages total incremental

Why It's Difficult

  • Websites change constantly without notice
  • Full re-crawls are wasteful and slow
  • Compliance requires proof of content at specific times
  • Storing complete pages uses lots of storage

Features

Built for archiving

FULL CAPTURE Storage

Full Resource Capture

Store complete snapshots including HTML, images, stylesheets, and scripts. Every crawl captures the page exactly as it appeared.

METADATA Compliance

Metadata Preservation

Capture timestamps, canonical URLs, HTTP headers, and response codes for every archived page. Build a verifiable record.

FORMATS Flexible

Multiple Output Formats

Store as raw HTML, clean markdown, or plain text. Choose the format that fits your archive system and downstream workflows.

WEBHOOKS Streaming

Webhook Delivery

Push archived content directly to your storage systems as pages are crawled. S3, GCS, or your own endpoints.

Quick Start

Set up incremental archiving

Schedule weekly crawls of your target site. Spider fingerprints every page and only re-fetches content that changed since the last snapshot. Pair with webhooks to push new versions directly into your archive storage.

Python Node.js Rust Go
# Incremental archive crawl
import spider

client = spider.Spider()

# First run crawls everything.
# Subsequent runs only fetch changed pages.
result = client.crawl(
  "https://docs.example.com",
  params={
    "limit": 500,
    "metadata": True,
    "return_format": "html,markdown",
    "store_data": True,
  },
)

# Changed pages only
for page in result:
  print(page["url"], page["status"])

Configuration

Fine-tune your archive scope

DEPTH Config

Depth Control

Configure how deep to crawl. Archive specific sections, subsites, or entire domains. Set page limits per run to control costs.

Scheduling Cron

Scheduled Crawls

Set up recurring crawls on a daily, weekly, or monthly schedule. Each run automatically detects what changed and only fetches the delta.

Storage S3

Direct-to-Storage

Send archived pages to S3, GCS, or your own webhook endpoint as they are crawled. No intermediate storage needed.

Resources

Keep exploring

Ready to preserve your websites?

Start building a comprehensive web archive today. Incremental crawling keeps costs low and your records complete.