Data Connectors

Pipe results straight into cloud storage or a database as pages come back. The data_connectors parameter works on every endpoint (crawl, scrape, search, screenshot). Spider sends each result to your destination as soon as it's ready.

Event Triggers

Two boolean flags on the data_connectors object decide when results get sent.

  • on_find: Send the full page content as soon as it's ready. Most common option.
  • on_find_metadata: Send lightweight metadata only (URL, status, headers) without the body.
Amazon S3

Upload each page as a JSON object to an S3 bucket. Objects are keyed by domain and timestamp.

FieldRequiredDescription
bucketYesThe S3 bucket name.
access_key_idYesAWS access key ID.
secret_access_keyYesAWS secret access key.
regionNoAWS region. Defaults to us-east-1.
prefixNoKey prefix for uploaded objects (e.g. "crawls/2024/").
content_typeNoMIME type for objects. Defaults to application/json.

Stream results to S3

import requests, os headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/json", } response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 50, "return_format": "markdown", "data_connectors": { "s3": { "bucket": "my-crawl-data", "access_key_id": os.getenv("AWS_ACCESS_KEY_ID"), "secret_access_key": os.getenv("AWS_SECRET_ACCESS_KEY"), "region": "us-west-2", "prefix": "crawls/" }, "on_find": True } }) print(response.json())
Google Cloud Storage

Upload pages to a GCS bucket. Pass the same service account JSON key you download from the IAM & Admin console, base64-encoded.

FieldRequiredDescription
bucketYesThe GCS bucket name.
service_account_base64YesBase64-encoded service account JSON key.
prefixNoKey prefix for uploaded objects.

Stream results to GCS

import requests, os, base64 headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/json", } # base64-encode your service account JSON file with open("service-account.json", "rb") as f: sa_b64 = base64.b64encode(f.read()).decode() response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 50, "return_format": "markdown", "data_connectors": { "gcs": { "bucket": "my-gcs-bucket", "service_account_base64": sa_b64, "prefix": "spider-data/" }, "on_find": True } }) print(response.json())
Google Sheets

Append results as rows to a Google Sheets spreadsheet. Share the spreadsheet with your service account email and set sheet_name to target a specific tab.

FieldRequiredDescription
spreadsheet_idYesThe spreadsheet ID from the Google Sheets URL.
service_account_base64YesBase64-encoded service account JSON key.
sheet_nameNoTarget sheet tab. Defaults to "Sheet1".

Stream results to Google Sheets

import requests, os, base64 headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/json", } with open("service-account.json", "rb") as f: sa_b64 = base64.b64encode(f.read()).decode() response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 20, "return_format": "markdown", "data_connectors": { "google_sheets": { "spreadsheet_id": "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms", "service_account_base64": sa_b64, "sheet_name": "Crawl Results" }, "on_find": True } }) print(response.json())
Azure Blob Storage

Write each page to an Azure Storage container. Pass the full connection string from the Azure portal.

FieldRequiredDescription
connection_stringYesAzure Storage connection string.
containerYesThe container name.
prefixNoBlob name prefix for uploaded objects.

Stream results to Azure Blob Storage

import requests, os headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/json", } response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 50, "return_format": "markdown", "data_connectors": { "azure_blob": { "connection_string": os.getenv("AZURE_STORAGE_CONNECTION_STRING"), "container": "crawl-data", "prefix": "results/" }, "on_find": True } }) print(response.json())
Supabase

Insert results into a Supabase Postgres table via PostgREST. Rows are batched automatically so you don't need to handle pagination.

FieldRequiredDescription
urlYesSupabase project URL (e.g. https://xxx.supabase.co).
anon_keyYesSupabase anon or service role key.
tableYesTarget table name for row inserts.

Stream results to Supabase

import requests, os headers = { "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}", "Content-Type": "application/json", } response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={ "url": "https://example.com", "limit": 50, "return_format": "markdown", "data_connectors": { "supabase": { "url": "https://your-project.supabase.co", "anon_key": os.getenv("SUPABASE_ANON_KEY"), "table": "crawled_pages" }, "on_find": True } }) print(response.json())

Multiple Connectors

You can configure more than one connector in the same request. All active connectors fire in parallel alongside webhooks.

S3 + Supabase in a single request

{ "url": "https://example.com", "limit": 100, "data_connectors": { "s3": { "bucket": "archive-bucket", "access_key_id": "AKIA...", "secret_access_key": "wJal..." }, "supabase": { "url": "https://xxx.supabase.co", "anon_key": "eyJhbGci...", "table": "pages" }, "on_find": true } }