Guides / Website Archiving

Website Archiving

Archive web pages with Spider. Capture full page resources, automate regular crawls, and store content for long-term access.

2 min read Jeff Mendez

Website Archiving with Spider

Spider can capture and store complete web pages for long-term archiving. This guide covers how to set up full-resource archiving, automate recurring crawls, and use incremental crawling to keep archives current.

Full Resource Storage

By default, Spider returns page content. To archive a complete page (HTML, images, scripts, stylesheets), enable full resource storing in the dashboard:

  1. Go to your website’s settings in the Spider dashboard.
  2. Toggle Full Resource Storing on.

This captures every resource the page loads, not just the HTML.

Automate Recurring Crawls

For up-to-date archives, schedule crawls using external tools. Spider provides webhooks to push results to your storage system as pages are crawled.

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://example.com",
    "limit": 500,
    "return_format": "raw",
    "store_data": True,
    "webhooks": {
      "destination": "https://your-server.com/archive-webhook"
    }
  }
)

print(f"Crawl started: {response.status_code}")

Trigger this from a cron job, GitHub Actions workflow, or cloud scheduler (Lambda, Cloud Functions) on whatever cadence you need.

Incremental Crawling

After the first full crawl, incremental crawling captures only pages that changed since the last run. This saves time and storage.

  1. Enable Incremental Crawling in the website configuration panel.
  2. Configure change detection criteria (file modifications, specific elements).
  3. Run crawls on schedule. Spider only stores the differences.

Batch Archiving

Submit multiple URLs in a single request to archive pages across different domains efficiently:

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

urls = [
    "https://example.com",
    "https://docs.example.com",
    "https://blog.example.com"
]

for url in urls:
    response = requests.post('https://api.spider.cloud/crawl',
      headers=headers,
      json={
        "url": url,
        "limit": 100,
        "return_format": "raw",
        "store_data": True
      }
    )
    print(f"Archiving {url}: {response.status_code}")

Download Archived Data

After crawling, download your archived data from the Spider dashboard. Organize by date or domain for easy retrieval. Back up to an external storage service (S3, GCS, local NAS) to prevent data loss.

For production archiving workflows, pair full resource storage with webhooks and incremental crawling. This gives you complete captures, automated delivery, and efficient updates without re-crawling unchanged content.

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.