Skip to main content
NEW AI Studio is now available Try it now

Compliance & Preservation

Preserve websites before they change

Regulatory pages get updated. Terms of service disappear. Evidence goes offline. If you did not capture it when it was live, it is gone. Spider builds incremental archives that track what changed between crawls, so every version is accounted for and nothing slips through the cracks.

Archive History docs.example.com
2026-02-18 v14 +3 pages changed
2026-02-11 v13 +1 page changed
2026-02-04 v12 +7 pages changed
2026-01-28 v11 no changes
2026-01-21 v10 +2 pages changed
2026-01-14 v9 +12 pages changed
2026-01-07 v8 initial crawl
7 snapshots 234 pages total incremental

How Spider handles archiving

CAPTURE Storage

Full resource capture

Store complete snapshots including HTML, images, stylesheets, and scripts. Every crawl captures the page exactly as it appeared in the browser, not a stripped-down approximation.

COMPLIANCE Legal

Compliance metadata

Every archived page includes the canonical URL, HTTP status code, and request duration. Enable return_headers to capture full response headers. Add your own capture timestamps at ingest time for a complete compliance trail.

FORMATS Flexible

Multiple output formats

Store as raw HTML (raw), clean markdown (markdown), or plain text (text). Request multiple formats in a single crawl with return_format: "raw,markdown" . Choose the format that fits your archive system and downstream review workflows.

DELIVERY Streaming

Direct-to-storage delivery

Use data connectors to stream archived pages directly to Amazon S3, Google Cloud Storage, Azure Blob, or Supabase as they are crawled. Webhooks fire on page discovery for custom integrations. Pages land in your system the moment they are captured.

SCOPE Config

Scope control

Archive specific sections, subsites, or full domains. Set page limits per run and configure crawl depth to match your retention policy and budget.

SCHEDULE Cron

Scheduled crawls

Set up daily, weekly, or monthly archive runs. Each one automatically detects changes and only fetches the delta. Costs stay predictable as your archive grows over months and years.

Set up incremental archiving

Schedule weekly crawls. Spider detects which pages changed since the last snapshot and only re-fetches those. Pair with data connectors or webhooks to push new versions directly into your archive storage.

Python Node.js Rust Go
# Incremental archive crawl
import spider

client = spider.Spider()

# First run crawls everything.
# Subsequent runs only fetch changed pages.
result = client.crawl_url(
   "https://docs.example.com" ,
  params={
     "limit" : 500,
     "metadata" : True,
     "return_format" : "raw,markdown",
     "return_headers" : True,
     "store_data" : True,
  },
)

# Each page includes url, status, content
for page in result:
  print(page[ "url" ], page["status"])

Start building your web archive today

Incremental crawling keeps costs predictable and your records complete. No infrastructure to manage.