Website Archiving with Spider
Spider can capture and store complete web pages for long-term archiving. This guide covers how to set up full-resource archiving, automate recurring crawls, and use incremental crawling to keep archives current.
Full Resource Storage
By default, Spider returns page content. To archive a complete page (HTML, images, scripts, stylesheets), enable full resource storing in the dashboard:
- Go to your website’s settings in the Spider dashboard.
- Toggle Full Resource Storing on.
This captures every resource the page loads, not just the HTML.
Automate Recurring Crawls
For up-to-date archives, schedule crawls using external tools. Spider provides webhooks to push results to your storage system as pages are crawled.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://example.com",
"limit": 500,
"return_format": "raw",
"store_data": True,
"webhooks": {
"destination": "https://your-server.com/archive-webhook"
}
}
)
print(f"Crawl started: {response.status_code}")
Trigger this from a cron job, GitHub Actions workflow, or cloud scheduler (Lambda, Cloud Functions) on whatever cadence you need.
Incremental Crawling
After the first full crawl, incremental crawling captures only pages that changed since the last run. This saves time and storage.
- Enable Incremental Crawling in the website configuration panel.
- Configure change detection criteria (file modifications, specific elements).
- Run crawls on schedule. Spider only stores the differences.
Batch Archiving
Submit multiple URLs in a single request to archive pages across different domains efficiently:
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
urls = [
"https://example.com",
"https://docs.example.com",
"https://blog.example.com"
]
for url in urls:
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": url,
"limit": 100,
"return_format": "raw",
"store_data": True
}
)
print(f"Archiving {url}: {response.status_code}")
Download Archived Data
After crawling, download your archived data from the Spider dashboard. Organize by date or domain for easy retrieval. Back up to an external storage service (S3, GCS, local NAS) to prevent data loss.
For production archiving workflows, pair full resource storage with webhooks and incremental crawling. This gives you complete captures, automated delivery, and efficient updates without re-crawling unchanged content.