Data Preservation

Archive websites
for the long term

Preserve website content for compliance, research, or historical records. Spider's incremental crawling captures only what changed since your last run, building comprehensive archives without redundant data or wasted resources.

Start archiving Archiving guide

Archive History docs.example.com

2026-02-18 v14 +3 pages changed

2026-02-11 v13 +1 page changed

2026-02-04 v12 +7 pages changed

2026-01-28 v11 no changes

2026-01-21 v10 +2 pages changed

2026-01-14 v9 +12 pages changed

2026-01-07 v8 initial crawl

7 snapshots 234 pages total incremental

Why It's Difficult

Websites change constantly without notice
Full re-crawls are wasteful and slow
Compliance requires proof of content at specific times
Storing complete pages uses lots of storage

What Spider Handles

Incremental crawling: only fetch what changed
Full resource capture (HTML, images, CSS)
Timestamps for compliance verification
Webhook delivery for your storage systems

Features

Built for archiving

Core Value Incremental

Incremental Crawling

Only fetch pages that changed since your last crawl. Spider compares page fingerprints across runs so you never re-download unchanged content. This saves bandwidth, time, and credits while building a complete version history of every page on the site.

Efficient Cost-effective Delta tracking

FULL CAPTURE Storage

Full Resource Capture

Store complete snapshots including HTML, images, stylesheets, and scripts. Every crawl captures the page exactly as it appeared.

METADATA Compliance

Metadata Preservation

Capture timestamps, canonical URLs, HTTP headers, and response codes for every archived page. Build a verifiable record.

FORMATS Flexible

Multiple Output Formats

Store as raw HTML, clean markdown, or plain text. Choose the format that fits your archive system and downstream workflows.

WEBHOOKS Streaming

Webhook Delivery

Push archived content directly to your storage systems as pages are crawled. S3, GCS, or your own endpoints.

Quick Start

Set up incremental archiving

Schedule weekly crawls of your target site. Spider fingerprints every page and only re-fetches content that changed since the last snapshot. Pair with webhooks to push new versions directly into your archive storage.

Python Node.js Rust Go

# Incremental archive crawl

import spider

client = spider.Spider()

# First run crawls everything.
# Subsequent runs only fetch changed pages.
result = client.crawl(
  "https://docs.example.com",
  params={
    "limit": 500,
    "metadata": True,
    "return_format": "html,markdown",
    "store_data": True,
  },
)

# Changed pages only
for page in result:
  print(page["url"], page["status"])

Configuration

Fine-tune your archive scope

DEPTH Config

Depth Control

Configure how deep to crawl. Archive specific sections, subsites, or entire domains. Set page limits per run to control costs.

Scheduling Cron

Scheduled Crawls

Set up recurring crawls on a daily, weekly, or monthly schedule. Each run automatically detects what changed and only fetches the delta.

Storage S3

Direct-to-Storage

Send archived pages to S3, GCS, or your own webhook endpoint as they are crawled. No intermediate storage needed.

Resources

Keep exploring

Guide

Archive websites
for the long term

Why It's Difficult

What Spider Handles

Built for archiving

Incremental Crawling

Full Resource Capture

Metadata Preservation

Multiple Output Formats

Webhook Delivery

Set up incremental archiving

Fine-tune your archive scope

Depth Control

Scheduled Crawls

Direct-to-Storage

Keep exploring

Archiving Guide

API Reference

Content Aggregation

Ready to preserve your websites?

Archive websites for the long term

Why It's Difficult

What Spider Handles

Built for archiving

Incremental Crawling

Full Resource Capture

Metadata Preservation

Multiple Output Formats

Webhook Delivery

Set up incremental archiving

Fine-tune your archive scope

Depth Control

Scheduled Crawls

Direct-to-Storage

Keep exploring

Archiving Guide

API Reference

Content Aggregation

Ready to preserve your websites?

Archive websites
for the long term