API Reference

Download OpenAPI Specification: Download

The Spider API is based on REST. Our API is predictable, returns JSON-encoded responses, uses standard HTTP response codes, authentication, and verbs. Set your API secret key in the authorization header to commence with the format Bearer $TOKEN. You can use the content-type header with application/json, application/xml, text/csv, and application/jsonl for shaping the response.

The Spider API varies for each account as we release new versions and tailor functionality. You can add v1 before any path to lock in that version. Executing a request on the page by pressing the Run button will consume live credits and treat the response as a genuine result. The system is constantly improving to ensure you can handle the dynamic aspects of the web. Spider provides all the tools you need to collect data from any website.

Just getting started?

Check out our development quickstart guide.

Not a developer?

Use Spiders no-code options or apps to get started with Spider and to do more with your Spider account no code required.

Base Url
https://api.spider.cloud

Crawl websites

Start crawling a website or websites to collect resources. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/crawl

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, json=json_data)

print(response.json())
Response
[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.0001,
      "file_cost": 0.0002,
      "bytes_transferred_cost": 0.0002,
      "total_cost": 0.0004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Perform a search and gather a list of websites to start crawling and collect resources.

POSThttps://api.spider.cloud/search

Request body

  • search required string

    The search query you want to search for.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"search":"a sports website","search_limit":3,"limit":5,"return_format":"markdown"}

response = requests.post('https://api.spider.cloud/search', 
  headers=headers, json=json_data)

print(response.json())
Response
[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.0001,
      "file_cost": 0.0002,
      "bytes_transferred_cost": 0.0002,
      "total_cost": 0.0004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Start crawling a website(s) to collect links found. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/links

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/links', 
  headers=headers, json=json_data)

print(response.json())
Response
[
  {
    "url": "https://spider.cloud",
    "status": 200,
    "error": null
  },
  // more content...
]

Screenshot websites

Take screenshots to base64 or binary encoding.

POSThttps://api.spider.cloud/screenshot

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/screenshot', 
  headers=headers, json=json_data)

print(response.json())
Response
[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.0001,
      "file_cost": 0.0002,
      "bytes_transferred_cost": 0.0002,
      "total_cost": 0.0004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Transform HTML

Transform HTML to Markdown or text fast. Each HTML transformation costs 1 credit. You can send up to 10MB of data at once. The transform API is also built into the /crawl endpoint by using return_format.

POSThttps://api.spider.cloud/transform

Request body

  • data required object

    A list of html data to transform. The object list takes the keys html and url. The url key is optional and only used when the readability is enabled.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","data":[{"html":"<html>\n<head>\n  <title>Example Transform</title>  \n</head>\n<body>\n<div>\n    <h1>Example Website</h1>\n    <p>This is some example markup to use to test the transform function.</p>\n    <p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</div>\n</body></html>","url":"https://example.com"}]}

response = requests.post('https://api.spider.cloud/transform', 
  headers=headers, json=json_data)

print(response.json())
Response
{
    "content": [
      "Example Domain
Example Domain
==========
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)"
    ],
    "cost": {
        "ai_cost": 0,
        "compute_cost": 0,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0.0001
    },
    "error": null,
    "status": 200
  }

Query

Query a resource from the global database instead of crawling a website. 1 credit per successful retrieval.

POSThttps://api.spider.cloud/data/query

Request body

  • url string

    The exact path of the url that you want to get.

  • domain string

    The website domain you want to query.

  • pathname string

    The website pathname you want to query.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/query', 
  headers=headers, json=json_data)

print(response.json())
Response
{
  "content": "<html>
    <body>
      <div>
          <h1>Example Website</h1>
      </div>
    </body>
  </html>",
  "error": null,
  "status": 200
}

Proxy-Mode
Alpha

Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance residential proxies up to 1TB per/s.

**HTTP address**: proxy.spider.cloud:8888**HTTPS address**: proxy.spider.cloud:8889**Username**: YOUR-API-KEY**Password**: PARAMETERS
Example proxy request
import requests, os


# Proxy configuration
proxies = {
    'http': f"http://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8888",
    'https': f"https://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8889"
}

# Function to make a request through the proxy
def get_via_proxy(url):
    try:
        response = requests.get(url, proxies=proxies)
        response.raise_for_status()
        print('Response HTTP Status Code: ', response.status_code)
        print('Response HTTP Response Body: ', response.content)
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return None

# Example usage
if __name__ == "__main__":
     get_via_proxy("https://www.example.com")
     get_via_proxy("https://www.example.com/community")

Pipelines

Create powerful workflows with our pipeline API endpoints. Use AI to extract leads from any website or filter links with prompts with ease.

Extract leads

Start crawling a website(s) to collect all leads utilizing AI. A minimum of $25 in credits is necessary for extraction.

POSThttps://api.spider.cloud/pipeline/extract-contacts

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/extract-contacts', 
  headers=headers, json=json_data)

print(response.json())
Response
[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.0001,
      "file_cost": 0.0002,
      "bytes_transferred_cost": 0.0002,
      "total_cost": 0.0004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Label website

Crawl a website and accurately categorize it using AI.

POSThttps://api.spider.cloud/pipeline/label

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/label', 
  headers=headers, json=json_data)

print(response.json())
Response
[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.0001,
      "file_cost": 0.0002,
      "bytes_transferred_cost": 0.0002,
      "total_cost": 0.0004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Crawl websites from text

Crawl website(s) found from raw text or markdown.

POSThttps://api.spider.cloud/pipeline/crawl-text

Request body

  • text required string

    The text string to extract urls from. The max limit for the text is 10mb.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"text":"Check this link: https://example.com and email to example@email.com","limit":5,"return_format":"markdown"}

response = requests.post('https://api.spider.cloud/pipeline/crawl-text', 
  headers=headers, json=json_data)

print(response.json())
Response
[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.0001,
      "file_cost": 0.0002,
      "bytes_transferred_cost": 0.0002,
      "total_cost": 0.0004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Filter links using AI and advanced metadata.

POSThttps://api.spider.cloud/pipeline/filter-links

Request body

  • url required array

    The urls to filter.

    Internet icon

    You can pass up to 4k tokens for the links and prompt.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/filter-links', 
  headers=headers, json=json_data)

print(response.json())
Response
{
    "content": [
        {
            "relevant_urls": [
                "https://spider.cloud",
                "https://foodnetwork.com"
            ]
        }
    ],
    "cost": {
        "ai_cost": 0.0005,
        "compute_cost": 0,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0
    },
    "error": "",
    "status": 200
}

Questions and Answers

Get a question-and-answer list for a website based on any inquiry.

POSThttps://api.spider.cloud/pipeline/extract-qa

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/extract-qa', 
  headers=headers, json=json_data)

print(response.json())
Response
{
  "content": [
    {
        "answer": "Spider is a data collecting solution designed for web crawling and scraping.",
        "question": "What is the primary function of Spider?"
    },
    {
        "answer": "You can kickstart your data collecting projects by signing up for a free trial or taking advantage of the promotional credits offered.",
        "question": "How can I get started with Spider?"
    },
    {
        "answer": "Spider offers unmatched speed, scalability, and comprehensive data curation, making it trusted by leading tech businesses.",
        "question": "What are the benefits of using Spider for data collection?"
    },
    {
        "answer": "Spider can easily crawl, search, and extract data from various sources, including social media platforms.",
        "question": "What kind of data can Spider extract?"
    },
    {
        "answer": "Spider is built fully in Rust for next-generation scalability.",
        "question": "What programming language is Spider built with?"
    }
  ],
    "cost": {
        "ai_cost": 0.0009,
        "compute_cost": 0.0001,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0
  },
  "error": null,
  "status": 200,
  "url": "https://spider.cloud"
}

Queries

Query the data that you collect during crawling and scraping. Add dynamic filters for extracting exactly what is needed.

Websites Collection

Get the websites stored.

GEThttps://api.spider.cloud/data/websites

Request params

  • limit string

    The limit of records to get.

  • url string

    Filter a single url record.

  • page number

    The current page to get.

  • domain string

    Filter a single domain record.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/websites?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": [
    {
      "id": "2a503c02-f161-444b-b1fa-03a3914667b6",
      "user_id": "6bd06efa-bb0b-4f1f-a23f-05db0c4b1bfd",
      "url": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd/example.com/index.html",
      "domain": "spider.cloud",
      "created_at": "2024-04-18T15:40:25.667063+00:00",
      "updated_at": "2024-04-18T15:40:25.667063+00:00",
      "pathname": "/",
      "fts": "",
      "scheme": "https:",
      "last_checked_at": "2024-05-10T13:39:32.293017+00:00",
      "screenshot": null
    }
  ],
  "count": 100
}

Pages Collection

Get the pages/resources stored.

GEThttps://api.spider.cloud/data/pages

Request params

  • limit string

    The limit of records to get.

  • url string

    Filter a single url record.

  • page number

    The current page to get.

  • domain string

    Filter a single domain record.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/pages?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": [
    {
      "id": "733b0d0f-e406-4229-949d-8068ade54752",
      "user_id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd",
      "url": "https://spider.cloud",
      "domain": "spider.cloud",
      "created_at": "2024-04-17T01:28:15.016975+00:00",
      "updated_at": "2024-04-17T01:28:15.016975+00:00",
      "proxy": true,
      "headless": true,
      "crawl_budget": null,
      "scheme": "https:",
      "last_checked_at": "2024-04-17T01:28:15.016975+00:00",
      "full_resources": false,
      "metadata": true,
      "gpt_config": null,
      "smart_mode": false,
      "fts": "'spider.cloud':1"
    }
  ],
  "count": 100
}

Pages Metadata Collection

Get the pages metadata/resources stored.

GEThttps://api.spider.cloud/data/pages_metadata

Request params

  • limit string

    The limit of records to get.

  • url string

    Filter a single url record.

  • page number

    The current page to get.

  • domain string

    Filter a single domain record.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/pages_metadata?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": [
    {
      "id": "e27a1995-2abe-4319-acd1-3dd8258f0f49",
      "user_id": "253524cd-3f94-4ed1-83b3-f7fab134c3ff",
      "url": "253524cd-3f94-4ed1-83b3-f7fab134c3ff/www.google.com/search?query=spider.cloud.html",
      "domain": "www.google.com",
      "resource_type": "html",
      "title": "spider.cloud - Google Search",
      "description": "",
      "file_size": 1253960,
      "embedding": null,
      "pathname": "/search",
      "created_at": "2024-05-18T17:40:16.4808+00:00",
      "updated_at": "2024-05-18T17:40:16.4808+00:00",
      "keywords": [
        "Fastest Web Crawler spider",
        "Web scraping"
      ],
      "labels": "Search Engine",
      "extracted_data": null,
      "fts": "'/search':1"
    }
  ],
  "count": 100
}

Leads Collection

Get the pages contacts stored.

GEThttps://api.spider.cloud/data/contacts

Request params

  • limit string

    The limit of records to get.

  • url string

    Filter a single url record.

  • page number

    The current page to get.

  • domain string

    Filter a single domain record.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/contacts?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": [
    {
      "full_name": "John Doe",
      "email": "johndoe@gmail.com",
      "phone": "555-555-555",
      "title": "Baker"
    }
  ],
  "count": 100
}

Crawl State

Get the state of the crawl for the domain.

GEThttps://api.spider.cloud/data/crawl_state

Request params

  • limit string

    The limit of records to get.

  • url string

    Filter a single url record.

  • page number

    The current page to get.

  • domain string

    Filter a single domain record.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crawl_state?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": {
    "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "domain": "spider.cloud",
    "url": "https://spider.cloud/",
    "links": 1,
    "credits_used": 3,
    "mode": 2,
    "crawl_duration": 340,
    "message": null,
    "request_user_agent": "Spider",
    "level": "info",
    "status_code": 0,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  },
  "error": null
}

Crawl Logs

Get the last 24 hours of logs.

GEThttps://api.spider.cloud/data/crawl_logs

Request params

  • limit string

    The limit of records to get.

  • url string

    Filter a single url record.

  • page number

    The current page to get.

  • domain string

    Filter a single domain record.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crawl_logs?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": {
    "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "domain": "spider.cloud",
    "url": "https://spider.cloud/",
    "links": 1,
    "credits_used": 3,
    "mode": 2,
    "crawl_duration": 340,
    "message": null,
    "request_user_agent": "Spider",
    "level": "info",
    "status_code": 0,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  },
  "error": null
}

Credits

Get the remaining credits available.

GEThttps://api.spider.cloud/data/credits

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/credits?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": {
    "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "credits": 53334,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  }
}

Crons

Get the cron jobs that are set to keep data fresh.

GEThttps://api.spider.cloud/data/crons

Request params

  • limit string

    The limit of records to get.

  • url string

    Filter a single url record.

  • page number

    The current page to get.

  • domain string

    Filter a single domain record.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crons?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response

User Profile

Get the profile of the user. This returns data such as approved limits and usage for the month.

GEThttps://api.spider.cloud/data/profile

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/crawl?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": {
    "id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd",
    "email": "user@gmail.com",
    "stripe_id": "cus_OYO2rAhSQaYqHT",
    "is_deleted": null,
    "proxy": null,
    "headless": false,
    "billing_limit": 50,
    "billing_limit_soft": 120,
    "approved_usage": 0,
    "crawl_budget": {
      "*": 200
    },
    "usage": null,
    "has_subscription": false,
    "depth": null,
    "full_resources": false,
    "meta_data": true,
    "billing_allowed": false,
    "initial_promo": false
  }
}

User-Agents

Get a real user agent to use for crawling.

GEThttps://api.spider.cloud/data/user_agents

Request params

  • limit string

    The limit of records to get.

  • os string

    Filter a by a device ex: Android, Mac OS, Android, Windows, Linux and more.

  • page number

    The current page to get.

  • platform string

    Filter a by a platform ex: Chrome, Edge, Safari, Firefox and more.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/user_agents?limit%3D5%26return_format%3Dmarkdown', 
  headers=headers)

print(response.json())
Response
{
  "data": {
    "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "platform": "Chrome",
    "platform_version": "123.0.0.0",
    "device": "Macintosh",
    "os": "Mac OS",
    "os_version": "10.15.7",
    "cpu_architecture": "",
    "mobile": false,
    "device_type": "desktop"
  }
}

Download file

Download a resource from storage.

GEThttps://api.spider.cloud/data/download

Request body

  • url string

    The exact path of the url that you want to get.

  • domain string

    The website domain you want to query.

  • pathname string

    The website pathname you want to query.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/download?', 
  headers=headers)

print(response.json())
Response
{
  "data": "<file>"
}

Manage

Configure data to enhance crawl efficiency: create, update, and delete records.

Website

Create or update a website by configuration.

POSThttps://api.spider.cloud/data/websites

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

  • limit number

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.

    Limit icon

    It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.

  • store_data boolean

    Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs are 0.30 cents per gigabyte per month.

    Store data icon

    Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, json=json_data)

print(response.json())
Response
{
  "data": null
}

Website

Delete a website from your collection. Remove the url body to delete all websites.

DELETEhttps://api.spider.cloud/data/websites

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/websites', 
  headers=headers, json=json_data)

print(response.json())
Response
{
  "data": null
}

Pages

Delete a web page from your collection. Remove the url body to delete all pages.

DELETEhttps://api.spider.cloud/data/pages

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/pages', 
  headers=headers, json=json_data)

print(response.json())
Response
{
  "data": null
}

Pages Metadata

Delete a web page metadata from your collection. Remove the url body to delete all pages metadata.

DELETEhttps://api.spider.cloud/data/pages_metadata

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/pages_metadata', 
  headers=headers, json=json_data)

print(response.json())
Response
{
  "data": null
}

Leads

Delete a contact or lead from your collection. Remove the url body to delete all contacts.

DELETEhttps://api.spider.cloud/data/contacts

Request body

  • url required string

    The URI resource to crawl. This can be a comma split list for multiple urls.

    Internet icon

    To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.

Request
import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/contacts', 
  headers=headers, json=json_data)

print(response.json())
Response
{
  "data": null
}