Skip to main content gottem  — one API for every scraper.
Website archiving

Preserve websites before they change.

Regulatory pages update. Terms of service disappear. Spider builds incremental archives that track what changed between crawls so every version is accounted for.

Archive history docs.example.com
  • 2026-02-18 v14 +3 pages changed
  • 2026-02-11 v13 +1 page changed
  • 2026-02-04 v12 +7 pages changed
  • 2026-01-28 v11 no changes
  • 2026-01-21 v10 +2 pages changed
  • 2026-01-14 v9 +12 pages changed
  • 2026-01-07 v8 initial crawl
7 snapshots 234 pages total incremental
01 · Core value

Only fetch what changed.

Spider stores crawl data and compares it against the live site on each subsequent run. Pages that have not changed are skipped. A documentation portal that updates a few pages per week will see large bandwidth savings; a news site with daily churn will see less. Either way, you only pay for what actually changed between snapshots.

Delta

Only changed pages are re-fetched on subsequent runs.

Stored

Crawl data is persisted between runs for comparison.

Complete

Full version history maintained across every run.

02 · How Spider handles archiving

Capture, store, and deliver every version.

Capture Storage

Full resource capture

Store complete snapshots including HTML, images, stylesheets, and scripts. Every crawl captures the page as it appeared in the browser, not a stripped-down approximation.

Compliance Legal

Compliance metadata

Every archived page includes canonical URL, HTTP status, and request duration. Enable return_headers to capture full response headers. Add your own capture timestamps at ingest for a complete audit trail.

Formats Flexible

Multiple output formats

Store as raw HTML, clean markdown, or plain text. Request multiple formats in a single crawl with return_format: "raw,markdown". Pick the format that fits your archive system.

Delivery Streaming

Direct-to-storage delivery

Stream archived pages to Amazon S3, Google Cloud Storage, Azure Blob, or Supabase via data connectors as they are crawled. Webhooks fire on page discovery for custom integrations.

Scope Config

Scope control

Archive specific sections, subsites, or full domains. Set page limits per run and configure crawl depth to match your retention policy and budget.

Schedule Cron

Scheduled crawls

Set up daily, weekly, or monthly archive runs. Each run detects changes and only fetches the delta. Costs stay predictable as the archive grows over months and years.

03 · Incremental archiving

Schedule weekly. Pay for the delta.

Spider detects which pages changed since the last snapshot and only re-fetches those. Pair with data connectors or webhooks to push new versions directly into archive storage.

Python Node.js Rust Go
Archive crawl Python
# Incremental archive crawl
import spider

client = spider.Spider()

# First run crawls everything.
# Subsequent runs only fetch changed pages.
result = client.crawl_url(
    "https://docs.example.com",
    params={
        "limit": 500,
        "metadata": True,
        "return_format": "raw,markdown",
        "return_headers": True,
        "store_data": True,
    },
)

# Each page includes url, status, content
for page in result:
    print(page["url"], page["status"])
04 · Resources

Keep reading.

Start

Start building your web archive.

Incremental crawling keeps costs predictable and your records complete. No infrastructure to manage.

spider crawl --store-data --metadata