Compliance & Preservation
Preserve websites
before they change
Regulatory pages get updated. Terms of service disappear. Evidence goes offline. If you did not capture it when it was live, it is gone. Spider builds incremental archives that track what changed between crawls, so every version is accounted for and nothing slips through the cracks.
Only fetch what changed
Spider stores crawl data and compares it against the live site on each subsequent run. Pages that have not changed are skipped. How much bandwidth you save depends on your site: a documentation portal that updates a few pages per week will see large savings, while a news site with daily content churn will see less. Either way, you only pay for what actually changed between snapshots.
How Spider handles archiving
Full resource capture
Store complete snapshots including HTML, images, stylesheets, and scripts. Every crawl captures the page exactly as it appeared in the browser, not a stripped-down approximation.
Compliance metadata
Every archived page includes the canonical URL, HTTP status code, and request duration.
Enable return_headers to capture full response headers. Add your own capture
timestamps at ingest time for a complete compliance trail.
Multiple output formats
Store as raw HTML (raw), clean markdown (markdown), or plain
text (text). Request multiple formats in a single crawl with
return_format: "raw,markdown"
. Choose the format that fits your archive system and downstream review
workflows.
Direct-to-storage delivery
Use data connectors to stream archived pages directly to Amazon S3, Google Cloud Storage, Azure Blob, or Supabase as they are crawled. Webhooks fire on page discovery for custom integrations. Pages land in your system the moment they are captured.
Scope control
Archive specific sections, subsites, or full domains. Set page limits per run and configure crawl depth to match your retention policy and budget.
Scheduled crawls
Set up daily, weekly, or monthly archive runs. Each one automatically detects changes and only fetches the delta. Costs stay predictable as your archive grows over months and years.
Set up incremental archiving
Schedule weekly crawls. Spider detects which pages changed since the last snapshot and only re-fetches those. Pair with data connectors or webhooks to push new versions directly into your archive storage.
client = spider.Spider()
# First run crawls everything.
# Subsequent runs only fetch changed pages.
result = client.crawl_url(
"https://docs.example.com" ,
params={
"limit" : 500,
"metadata" : True,
"return_format" : "raw,markdown",
"return_headers" : True,
"store_data" : True,
},
)
# Each page includes url, status, content
for page in result:
print(page[ "url" ], page["status"])