Data Connectors

Pipe results straight into cloud storage or a database as pages come back. The data_connectors parameter works on every endpoint (crawl, scrape, search, screenshot). Spider sends each result to your destination as soon as it's ready.

Supported Connectors

Pick one or more per request. They fire in parallel and work alongside webhooks.

Amazon S3s3Upload each page as a JSON object to an S3 bucket.Google Cloud StoragegcsUpload pages to a GCS bucket using service account credentials.Google Sheetsgoogle_sheetsAppend rows to a spreadsheet as pages are processed.Azure Blob Storageazure_blobWrite pages to an Azure Storage container.SupabasesupabaseInsert rows into a Supabase Postgres table via PostgREST.

Event Triggers

Two boolean flags on the data_connectors object decide when results get sent.

on_find: Send the full page content as soon as it's ready. Most common option.
on_find_metadata: Send lightweight metadata only (URL, status, headers) without the body.

Pro Tip:

Both flags default to false. You must set at least one to true or the connector will not fire. Use on_find for most use cases.

Amazon S3

Upload each page as a JSON object to an S3 bucket. Objects are keyed by domain and timestamp.

Field	Required	Description
`bucket`	Yes	The S3 bucket name.
`access_key_id`	Yes	AWS access key ID.
`secret_access_key`	Yes	AWS secret access key.
`region`	No	AWS region. Defaults to us-east-1.
`prefix`	No	Key prefix for uploaded objects (e.g. "crawls/2024/").
`content_type`	No	MIME type for objects. Defaults to application/json.

Stream results to S3

import requests, os

headers = {
    "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
    "Content-Type": "application/json",
}

response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
    "url": "https://example.com",
    "limit": 50,
    "return_format": "markdown",
    "data_connectors": {
        "s3": {
            "bucket": "my-crawl-data",
            "access_key_id": os.getenv("AWS_ACCESS_KEY_ID"),
            "secret_access_key": os.getenv("AWS_SECRET_ACCESS_KEY"),
            "region": "us-west-2",
            "prefix": "crawls/"
        },
        "on_find": True
    }
})

print(response.json())

Google Cloud Storage

Upload pages to a GCS bucket. Pass the same service account JSON key you download from the IAM & Admin console, base64-encoded.

Field	Required	Description
`bucket`	Yes	The GCS bucket name.
`service_account_base64`	Yes	Base64-encoded service account JSON key.
`prefix`	No	Key prefix for uploaded objects.

Stream results to GCS

import requests, os, base64

headers = {
    "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
    "Content-Type": "application/json",
}

# base64-encode your service account JSON file
with open("service-account.json", "rb") as f:
    sa_b64 = base64.b64encode(f.read()).decode()

response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
    "url": "https://example.com",
    "limit": 50,
    "return_format": "markdown",
    "data_connectors": {
        "gcs": {
            "bucket": "my-gcs-bucket",
            "service_account_base64": sa_b64,
            "prefix": "spider-data/"
        },
        "on_find": True
    }
})

print(response.json())

Google Sheets

Append results as rows to a Google Sheets spreadsheet. Share the spreadsheet with your service account email and set sheet_name to target a specific tab.

Field	Required	Description
`spreadsheet_id`	Yes	The spreadsheet ID from the Google Sheets URL.
`service_account_base64`	Yes	Base64-encoded service account JSON key.
`sheet_name`	No	Target sheet tab. Defaults to "Sheet1".

Stream results to Google Sheets

import requests, os, base64

headers = {
    "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
    "Content-Type": "application/json",
}

with open("service-account.json", "rb") as f:
    sa_b64 = base64.b64encode(f.read()).decode()

response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
    "url": "https://example.com",
    "limit": 20,
    "return_format": "markdown",
    "data_connectors": {
        "google_sheets": {
            "spreadsheet_id": "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms",
            "service_account_base64": sa_b64,
            "sheet_name": "Crawl Results"
        },
        "on_find": True
    }
})

print(response.json())

Azure Blob Storage

Write each page to an Azure Storage container. Pass the full connection string from the Azure portal.

Field	Required	Description
`connection_string`	Yes	Azure Storage connection string.
`container`	Yes	The container name.
`prefix`	No	Blob name prefix for uploaded objects.

Stream results to Azure Blob Storage

import requests, os

headers = {
    "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
    "Content-Type": "application/json",
}

response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
    "url": "https://example.com",
    "limit": 50,
    "return_format": "markdown",
    "data_connectors": {
        "azure_blob": {
            "connection_string": os.getenv("AZURE_STORAGE_CONNECTION_STRING"),
            "container": "crawl-data",
            "prefix": "results/"
        },
        "on_find": True
    }
})

print(response.json())

Supabase

Insert results into a Supabase Postgres table via PostgREST. Rows are batched automatically so you don't need to handle pagination.

Field	Required	Description
`url`	Yes	Supabase project URL (e.g. https://xxx.supabase.co).
`anon_key`	Yes	Supabase anon or service role key.
`table`	Yes	Target table name for row inserts.

Stream results to Supabase

import requests, os

headers = {
    "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
    "Content-Type": "application/json",
}

response = requests.post("https://api.spider.cloud/crawl", headers=headers, json={
    "url": "https://example.com",
    "limit": 50,
    "return_format": "markdown",
    "data_connectors": {
        "supabase": {
            "url": "https://your-project.supabase.co",
            "anon_key": os.getenv("SUPABASE_ANON_KEY"),
            "table": "crawled_pages"
        },
        "on_find": True
    }
})

print(response.json())

Multiple Connectors

You can configure more than one connector in the same request. All active connectors fire in parallel alongside webhooks.

S3 + Supabase in a single request

{
  "url": "https://example.com",
  "limit": 100,
  "data_connectors": {
    "s3": {
      "bucket": "archive-bucket",
      "access_key_id": "AKIA...",
      "secret_access_key": "wJal..."
    },
    "supabase": {
      "url": "https://xxx.supabase.co",
      "anon_key": "eyJhbGci...",
      "table": "pages"
    },
    "on_find": true
  }
}

Pro Tip:

Data connectors work on every endpoint: crawl, scrape, search, and screenshot. Combine them with streaming to process results locally and push to cloud storage at the same time.