Archive websites
for the long term
Preserve website content for compliance, research, or historical records. Spider's incremental crawling captures only what changed since your last run, building comprehensive archives without redundant data or wasted resources.
Why It's Difficult
- Websites change constantly without notice
- Full re-crawls are wasteful and slow
- Compliance requires proof of content at specific times
- Storing complete pages uses lots of storage
What Spider Handles
- Incremental crawling: only fetch what changed
- Full resource capture (HTML, images, CSS)
- Timestamps for compliance verification
- Webhook delivery for your storage systems
Features
Built for archiving
Incremental Crawling
Only fetch pages that changed since your last crawl. Spider compares page fingerprints across runs so you never re-download unchanged content. This saves bandwidth, time, and credits while building a complete version history of every page on the site.
Full Resource Capture
Store complete snapshots including HTML, images, stylesheets, and scripts. Every crawl captures the page exactly as it appeared.
Metadata Preservation
Capture timestamps, canonical URLs, HTTP headers, and response codes for every archived page. Build a verifiable record.
Multiple Output Formats
Store as raw HTML, clean markdown, or plain text. Choose the format that fits your archive system and downstream workflows.
Webhook Delivery
Push archived content directly to your storage systems as pages are crawled. S3, GCS, or your own endpoints.
Quick Start
Set up incremental archiving
Schedule weekly crawls of your target site. Spider fingerprints every page and only re-fetches content that changed since the last snapshot. Pair with webhooks to push new versions directly into your archive storage.
client = spider.Spider()
# First run crawls everything.
# Subsequent runs only fetch changed pages.
result = client.crawl(
"https://docs.example.com",
params={
"limit": 500,
"metadata": True,
"return_format": "html,markdown",
"store_data": True,
},
)
# Changed pages only
for page in result:
print(page["url"], page["status"])
Configuration
Fine-tune your archive scope
Depth Control
Configure how deep to crawl. Archive specific sections, subsites, or entire domains. Set page limits per run to control costs.
Scheduled Crawls
Set up recurring crawls on a daily, weekly, or monthly schedule. Each run automatically detects what changed and only fetches the delta.
Direct-to-Storage
Send archived pages to S3, GCS, or your own webhook endpoint as they are crawled. No intermediate storage needed.
Resources