Spider API Reference

Crawl

Start crawling website(s) to collect resources. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/crawl

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Scrape

Start scraping a single page on website(s) to collect resources. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/scrape

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
full_page boolean
Take a screenshot of the full page. Defaults to true.

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
binary boolean
Return the image as binary instead of base64.
cdp_params object
The settings to use to adjust clip, format, quality, and more. Defaults to null.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/scrape', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Search

Perform a Google search to gather a list of websites for crawling and resource collection, including fallback options if the query yields no results. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/search

Request body

search required string
The search query you want to search for.
Search
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
quick_search boolean
Prioritize speed over output quantity.
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

search_limit number
The limit amount of urls to fetch or crawl from the search results. Remove the value or set it to 0 to crawl all URLs from the realtime search results. This is a shorthand if you do not want to use num.
fetch_page_content boolean
Fetch all the content of the websites by performing crawls. If this is disabled, only the search results are returned instead with the meta title and description. Defaults to false.
request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
auto_pagination boolean
Automatically paginates to fetch the exact number of desired results, as specified by the num parameter. Note that credit usage may increase, and response time may be slower when retrieving larger result sets.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
num number
The maximum number of results to return for the search.
page number
The page number for the search results.
tbs 'qdr:h' | 'qdr:d' | 'qdr:w' | 'qdr:m' | 'qdr:y'
Restrict results to a specific time range. Common options:qdr:h (past hour), qdr:d (past 24 hours), qdr:w (past week),qdr:m (past month), qdr:y (past year).

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"search":"sports news today","search_limit":3,"limit":5,"return_format":"markdown"}

response = requests.post('https://api.spider.cloud/search', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "content": [
      {
          "description": "Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.",
          "title": "ESPN - Serving Sports Fans. Anytime. Anywhere.",
          "url": "https://www.espn.com/"
      },
      {
          "description": "Sports Illustrated, SI.com provides sports news, expert analysis, highlights, stats and scores for the NFL, NBA, MLB, NHL, college football, soccer,&nbsp;...",
          "title": "Sports Illustrated",
          "url": "https://www.si.com/"
      },
      {
          "description": "CBS Sports features live scoring, news, stats, and player info for NFL football, MLB baseball, NBA basketball, NHL hockey, college basketball and football.",
          "title": "CBS Sports - News, Live Scores, Schedules, Fantasy ...",
          "url": "https://www.cbssports.com/"
      },
      {
          "description": "Sport is a form of physical activity or game. Often competitive and organized, sports use, maintain, or improve physical ability and skills.",
          "title": "Sport",
          "url": "https://en.wikipedia.org/wiki/Sport"
      },
      {
          "description": "Watch FOX Sports and view live scores, odds, team news, player news, streams, videos, stats, standings &amp; schedules covering NFL, MLB, NASCAR, WWE, NBA, NHL,&nbsp;...",
          "title": "FOX Sports News, Scores, Schedules, Odds, Shows, Streams ...",
          "url": "https://www.foxsports.com/"
      },
      {
          "description": "Founded in 1974 by tennis legend, Billie Jean King, the Women's Sports Foundation is dedicated to creating leaders by providing girls access to sports.",
          "title": "Women's Sports Foundation: Home",
          "url": "https://www.womenssportsfoundation.org/"
      },
      {
          "description": "List of sports · Running. Marathon · Sprint · Mascot race · Airsoft · Laser tag · Paintball · Bobsleigh · Jack jumping · Luge · Shovel racing · Card stacking&nbsp;...",
          "title": "List of sports",
          "url": "https://en.wikipedia.org/wiki/List_of_sports"
      },
      {
          "description": "Stay up-to-date with the latest sports news and scores from NBC Sports.",
          "title": "NBC Sports - news, scores, stats, rumors, videos, and more",
          "url": "https://www.nbcsports.com/"
      },
      {
          "description": "r/sports: Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.",
          "title": "r/sports",
          "url": "https://www.reddit.com/r/sports/"
      },
      {
          "description": "The A-Z of sports covered by the BBC Sport team. Find all the latest live sports coverage, breaking news, results, scores, fixtures, tables,&nbsp;...",
          "title": "AZ Sport",
          "url": "https://www.bbc.com/sport/all-sports"
      }
  ]
}

Links

Start crawling a website(s) to collect links found. You can pass an array of objects for the request body. This endpoint can save on latency if you only need to index the content URLs.

POSThttps://api.spider.cloud/links

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/links', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "url": "https://spider.cloud",
    "status": 200,
    "error": null
  },
  // more content...
]

Screenshot

Take screenshots to base64 or binary encoding. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/screenshot

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
full_page boolean
Take a screenshot of the full page. Defaults to true.

css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
binary boolean
Return the image as binary instead of base64.
cdp_params object
The settings to use to adjust clip, format, quality, and more. Defaults to null.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/screenshot', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Transform HTML

Transform HTML to Markdown or text fast. Each HTML transformation costs 0.1 credits. You can send up to 10MB of data at once. The transform API is also built into the /crawl endpoint by using return_format.

POSThttps://api.spider.cloud/transform

Request body

data required object
A list of html data to transform. The object list takes the keys html and url. The url key is optional and only used when the readability is enabled.
Data<html><body> <h1>Example Website</h1> <p>This is some example markup to use to test the transform function.</p> <p><a href="https://spider.cloud/guides">Guides</a></p> </body></html>

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","data":[{"html":"<html><body>\n<h1>Example Website</h1>\n<p>This is some example markup to use to test the transform function.</p>\n<p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</body></html>","url":"https://example.com"}]}

response = requests.post('https://api.spider.cloud/transform', 
  headers=headers, json=json_data)

print(response.json())

Response

{
    "content": [
      "# Example Website
This is some example markup to use to test the transform function.
[Guides](https://spider.cloud/guides)"
    ],
    "cost": {
        "ai_cost": 0,
        "compute_cost": 0,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0.0001
    },
    "error": null,
    "status": 200
  }

Query

Query a resource from the global database instead of crawling a website. 0.1 credits per successful retrieval.

POSThttps://api.spider.cloud/data/query

Request body

url string
The exact path of the url that you want to get.
Test Url
domain string
The website domain you want to query.
pathname string
The website pathname you want to query.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {"url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/query', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "content": "<html>
    <body>
      <div>
          <h1>Example Website</h1>
      </div>
    </body>
  </html>",
  "error": null,
  "status": 200
}

Proxy-Mode

Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance and residential proxies up to 10GB per/s. Take a look at all of our proxy locations to see if we support the country.

**HTTP address**: proxy.spider.cloud:8888**HTTPS address**: proxy.spider.cloud:8889**Username**: YOUR-API-KEY**Password**: PARAMETERS

Residential

Speed: Up to 1GB/s
Purpose: Real-User IPs, Global Reach, High Anonymity
Cost: $1/GB - $4/GB

ISP

Speed: Up to 10GB/s
Purpose: Stable Datacenter IPs, Highest Performance
Cost: $1/GB

Mobile

Speed: Up to 100MB/s
Purpose: Real Mobile Devices, Avoid Detection
Cost: $2/GB

Use the country_code parameter to determine the proxy geolocation and the proxy parameter to change the proxy.

Proxy Type	Price	Multiplier	Description
residential	$1.00/GB	×1	Entry-level residential pool
residential_static	$1.00/GB	×1	Static IPs for long-lived sessions
residential_fast	$1.50/GB	×1.5	High-speed residential for heavy throughput
residential_core	$1.50/GB	×1.5	Balanced quality and cost
residential_plus	$3.00/GB	×3.0	Largest, highest-quality residential pool
residential_premium	$4.00/GB	×4.0	Low-latency premium residential pool
mobile	$2.00/GB	×2.0	4G/5G mobile proxies for stealth
isp	$1.00/GB	×1	ISP-grade residential routing

Example proxy request

import requests, os


# Proxy configuration
proxies = {
    'http': f"http://{os.getenv('SPIDER_API_KEY')}:proxy=residential@proxy.spider.cloud:8888",
    'https': f"https://{os.getenv('SPIDER_API_KEY')}:proxy=residential@proxy.spider.cloud:8889"
}

# Function to make a request through the proxy
def get_via_proxy(url):
    try:
        response = requests.get(url, proxies=proxies)
        response.raise_for_status()
        print('Response HTTP Status Code: ', response.status_code)
        print('Response HTTP Response Body: ', response.content)
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return None

# Example usage
if __name__ == "__main__":
     get_via_proxy("https://www.example.com")
     get_via_proxy("https://www.example.com/community")

Extract leads

Start crawling a website(s) to collect all leads utilizing AI. A minimum of $25 in credits is necessary for extraction.

POSThttps://api.spider.cloud/pipeline/extract-contacts

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/extract-contacts', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Label website

Crawl a website and accurately categorize it using AI.

POSThttps://api.spider.cloud/pipeline/label

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/label', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Crawl from text

Crawl website(s) found from raw text or markdown.

POSThttps://api.spider.cloud/pipeline/crawl-text

Request body

text required string
The text string to extract URLs from. The max limit for the text is 10mb.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"text":"Check this link: https://example.com and email to example@email.com","limit":5,"return_format":"markdown"}

response = requests.post('https://api.spider.cloud/pipeline/crawl-text', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Filter links

Filter links using AI and advanced metadata.

POSThttps://api.spider.cloud/pipeline/filter-links

Request body

url required array
The URLs to filter.
You can pass up to 4k tokens for the links and prompt.
Test Url

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/filter-links', 
  headers=headers, json=json_data)

print(response.json())

Response

{
    "content": [
        {
            "relevant_urls": [
                "https://spider.cloud",
                "https://foodnetwork.com"
            ]
        }
    ],
    "cost": {
        "ai_cost": 0.0005,
        "compute_cost": 0,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0
    },
    "error": "",
    "status": 200
}

Questions and Answers

Get a question-and-answer list for a website based on any inquiry.

POSThttps://api.spider.cloud/pipeline/extract-qa

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
custom_func string
A custom function to run before processing. Contact us for getting a custom function tailored to you.
custom_prompt string
A custom prompt to pass in to the model.
model string
The type of AI model to use like gpt-4.1, gpt-4o, gpt-4o-mini, or gpt-3.5-turbo etc. Defaults to gpt-4o-mini.

css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/pipeline/extract-qa', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "content": [
    {
        "answer": "Spider is a data collecting solution designed for web crawling and scraping.",
        "question": "What is the primary function of Spider?"
    },
    {
        "answer": "You can kickstart your data collecting projects by signing up for a free trial or taking advantage of the promotional credits offered.",
        "question": "How can I get started with Spider?"
    },
    {
        "answer": "Spider offers unmatched speed, scalability, and comprehensive data curation, making it trusted by leading tech businesses.",
        "question": "What are the benefits of using Spider for data collection?"
    },
    {
        "answer": "Spider can easily crawl, search, and extract data from various sources, including social media platforms.",
        "question": "What kind of data can Spider extract?"
    },
    {
        "answer": "Spider is built fully in Rust for next-generation scalability.",
        "question": "What programming language is Spider built with?"
    }
  ],
    "cost": {
        "ai_cost": 0.0009,
        "compute_cost": 0.0001,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0
  },
  "error": null,
  "status": 200,
  "url": "https://spider.cloud"
}

Websites Collection

Get the websites stored.

GEThttps://api.spider.cloud/data/websites

Request params

url string
Filter a single url record.
Test Url
limit string
The limit of records to get.
Crawl Limit
domain string
Filter a single domain record.
page number
The current page to get.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/websites?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": [
    {
      "id": "2a503c02-f161-444b-b1fa-03a3914667b6",
      "user_id": "6bd06efa-bb0b-4f1f-a23f-05db0c4b1bfd",
      "url": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd/example.com/11377799497277734985.html",
      "domain": "spider.cloud",
      "created_at": "2024-04-18T15:40:25.667063+00:00",
      "updated_at": "2024-04-18T15:40:25.667063+00:00",
      "pathname": "/",
      "fts": "",
      "scheme": "https:",
      "last_checked_at": "2024-05-10T13:39:32.293017+00:00",
      "screenshot": null
    }
  ],
  "count": 100
}

Pages Collection

Get the pages/resources stored.

GEThttps://api.spider.cloud/data/pages

Request params

url string
Filter a single url record.
Test Url
limit string
The limit of records to get.
Crawl Limit
domain string
Filter a single domain record.
page number
The current page to get.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/pages?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": [
    {
      "id": "733b0d0f-e406-4229-949d-8068ade54752",
      "url": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd/spider.cloud/11377799497277734985.html",
      "url": "https://spider.cloud",
      "domain": "spider.cloud",
      "created_at": "2024-04-17T01:28:15.016975+00:00",
      "updated_at": "2024-04-17T01:28:15.016975+00:00",
      "proxy": true,
      "headless": true,
      "crawl_budget": null,
      "scheme": "https:",
      "last_checked_at": "2024-04-17T01:28:15.016975+00:00",
      "full_resources": false,
      "metadata": true,
      "gpt_config": null,
      "smart_mode": false,
      "fts": "'spider.cloud':1"
    }
  ],
  "count": 100
}

Pages Metadata Collection

Get the pages metadata/resources stored.

GEThttps://api.spider.cloud/data/pages_metadata

Request params

url string
Filter a single url record.
Test Url
limit string
The limit of records to get.
Crawl Limit
domain string
Filter a single domain record.
page number
The current page to get.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/pages_metadata?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": [
    {
      "id": "e27a1995-2abe-4319-acd1-3dd8258f0f49",
      "user_id": "253524cd-3f94-4ed1-83b3-f7fab134c3ff",
      "url": "253524cd-3f94-4ed1-83b3-f7fab134c3ff/www.google.com/11377799497277734985.html",
      "domain": "www.google.com",
      "resource_type": "html",
      "title": "spider.cloud - Google Search",
      "description": "",
      "file_size": 1253960,
      "embedding": null,
      "pathname": "/search",
      "created_at": "2024-05-18T17:40:16.4808+00:00",
      "updated_at": "2024-05-18T17:40:16.4808+00:00",
      "keywords": [
        "Fastest Web Crawler spider",
        "Web scraping"
      ],
      "labels": "Search Engine",
      "extracted_data": null,
      "fts": "'/search':1"
    }
  ],
  "count": 100
}

Leads Collection

Get the pages contacts stored.

GEThttps://api.spider.cloud/data/contacts

Request params

url string
Filter a single url record.
Test Url
limit string
The limit of records to get.
Crawl Limit
domain string
Filter a single domain record.
page number
The current page to get.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/contacts?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": [
    {
      "full_name": "John Doe",
      "email": "johndoe@gmail.com",
      "phone": "555-555-555",
      "title": "Baker"
    }
  ],
  "count": 100
}

Crawl State

Get the state of the crawl for the domain.

GEThttps://api.spider.cloud/data/crawl_state

Request params

url string
Filter a single url record.
Test Url
limit string
The limit of records to get.
Crawl Limit
domain string
Filter a single domain record.
page number
The current page to get.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crawl_state?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "domain": "spider.cloud",
    "url": "https://spider.cloud",
    "links": 1,
    "credits_used": 3,
    "mode": 2,
    "crawl_duration": 340,
    "message": null,
    "request_user_agent": "Spider",
    "level": "UI",
    "status_code": 0,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  },
  "error": null
}

Logs

Get the last 24 hours of logs.

GEThttps://api.spider.cloud/data/crawl_logs

Request params

url string
Filter a single url record.
Test Url
limit string
The limit of records to get.
Crawl Limit
domain string
Filter a single domain record.
page number
The current page to get.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crawl_logs?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "domain": "spider.cloud",
    "url": "https://spider.cloud",
    "links": 1,
    "credits_used": 3,
    "mode": 2,
    "crawl_duration": 340,
    "message": null,
    "request_user_agent": "Spider",
    "level": "UI",
    "status_code": 0,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  },
  "error": null
}

Credits

Get the remaining credits available.

GEThttps://api.spider.cloud/data/credits

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/credits?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "credits": 53334,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  }
}

Crons

Get the cron jobs that are set to keep data fresh.

GEThttps://api.spider.cloud/data/crons

Request params

url string
Filter a single url record.
Test Url
limit string
The limit of records to get.
Crawl Limit
domain string
Filter a single domain record.
page number
The current page to get.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crons?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

User Profile

Get the profile of the user. This returns data such as approved limits and usage for the month.

GEThttps://api.spider.cloud/data/profile

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/crawl?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd",
    "email": "user@gmail.com",
    "stripe_id": "cus_OYO2rAhSQaYqHT",
    "is_deleted": null,
    "proxy": null,
    "headless": false,
    "billing_limit": 50,
    "billing_limit_soft": 120,
    "approved_usage": 0,
    "crawl_budget": {
      "*": 200
    },
    "usage": null,
    "has_subscription": false,
    "depth": null,
    "full_resources": false,
    "meta_data": true,
    "billing_allowed": false,
    "initial_promo": false
  }
}

User-Agents

Get a real user agent to use for crawling.

GEThttps://api.spider.cloud/data/user_agents

Request params

limit string
The limit of records to get.
Crawl Limit
os string
Filter a by a device ex: Android, Mac OS, Android, Windows, Linux and more.
page number
The current page to get.
platform string
Filter a by a platform ex: Chrome, Edge, Safari, Firefox and more.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/user_agents?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "platform": "Chrome",
    "platform_version": "123.0.0.0",
    "device": "Macintosh",
    "os": "Mac OS",
    "os_version": "10.15.7",
    "cpu_architecture": "",
    "mobile": false,
    "device_type": "desktop"
  }
}

Download file

Download a resource from storage.

GEThttps://api.spider.cloud/data/download

Request body

url string
The exact path of the url that you want to get.
Test Url
domain string
The website domain you want to query.
pathname string
The website pathname you want to query.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/download?url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": "<file>"
}

Website

Create or update a website by configuration.

POSThttps://api.spider.cloud/data/websites

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
store_data boolean
Decide whether to store data. If enabled, this option overrides storageless. The default setting is false. Storage costs $0.30 per GB per month.
Set to true to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
lite_mode boolean
Lite mode reduces data transfer costs by 70%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.

request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
request_timeout number
The timeout to use for request. Timeouts can be from 5-255. The Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network to be idle. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
sitemap boolean
Include the sitemap results to crawl. The Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
sitemap_only boolean
Only include the sitemap results to crawl. The Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
subdomains boolean
Allow subdomains to be included. Defaults to false.
tld boolean
Allow TLD's to be included. Defaults to false.
root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
event_tracker object
Track the event request and responses sent when using browser rendering.
crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
cron string
Set a cron period to run the website crawls automatically. Possible values are daily, weekly, and monthly.

return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
clean_html bool
Clean the HTML of unwanted attributes.
return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
filter_output_svg bool
Filter the svg tags from the output.
filter_output_images bool
Filter the images from the output.
filter_output_main_only bool
Filter the nav, aside, and footer from the output.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'
Select the proxy pool you’d like to use for this request. Leave it blank to disable proxy routing. Supported values are:
- residential – entry-level residential pool.
- residential_fast – high-speed residential pool for heavier throughput.
- residential_static – static residential IPs for long-lived sessions.
- mobile – 4G / 5G mobile proxies for maximum stealth.
- isp (datacenter alias) – ISP-grade / datacenter-style routing with residential ASN.
- residential_premium – lower-latency premium residential pool.
- residential_core – balanced cost/quality residential pool.
- residential_plus – largest, highest-quality residential pool.
Each pool carries a different price multiplier (from ×1.2 for residential up to ×2.0 for mobile, and ×4.0 for residential_premium). See the pricing table for full details. Using this param overrides all other proxy_* shorthand configurations.
remote_proxy string
Use a remote external proxy connection. You also save 70% on data transfer costs when you bring your own proxy.
Use your own proxy to bypass any firewall as needed or connect to private web servers.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
headers object
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
Using HTTP headers can help with authenticated pages that use the authorization header field.
blacklist array
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
fingerprint boolean
Use advanced fingerprint detection for chrome. Defaults to true.
Set this value to help crawl when websites require a fingerprint.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to true.
This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to true.
Set to true to almost guarantee not being detected by anything.
proxy_datacenter boolean
Enable large pool of datacenter proxies. Defaults to false.
proxy_enabled boolean
Enable premium high performance proxies to prevent detection and increase speed. Defaults to false.
Using this configuration can help when network requests are blocked. This setup increases the cost for file_cost and bytes_transferred_cost, but only by 1.5 times.
proxy_lightning boolean
Enable high performance residential proxies. Defaults to false.
proxy_mobile boolean
Enable mobile proxies for undetectable crawling at the network layer. Defaults to false. The credits cost for file_cost and bytes_transferred_cost is doubled when set to true.

scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
device object
Configure the device for chrome. One of mobile, tablet, or desktop. Defaults to desktop.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the prompt to chain steps.
Perform event-driven browser actions on a web page or extract data using the gpt_config, you need to use your own API key or have at least $100 in credits to use AI.
evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "data": null
}

Website

Delete a website from your collection. Remove the url body to delete all websites.

DELETEhttps://api.spider.cloud/data/websites

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/websites', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "data": null
}

Pages

Delete a web page from your collection. Remove the url body to delete all pages.

DELETEhttps://api.spider.cloud/data/pages

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/pages', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "data": null
}

Pages Metadata

Delete a web page metadata from your collection. Remove the url body to delete all pages metadata.

DELETEhttps://api.spider.cloud/data/pages_metadata

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/pages_metadata', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "data": null
}

Leads

Delete a contact or lead from your collection. Remove the url body to delete all contacts.

DELETEhttps://api.spider.cloud/data/contacts

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url

Request

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/data/contacts', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "data": null
}

API Reference

Just getting started?

Not a developer?

Crawl

url required string

limit number

store_data boolean

lite_mode boolean

request string

depth number

metadata boolean

session boolean

request_timeout number

wait_for object

webhooks object

user_agent string

sitemap boolean

sitemap_only boolean

sitemap_path string

subdomains boolean

tld boolean

root_selector string

preserve_host boolean

full_resources boolean

redirect_policy string

external_domains array

exclude_selector string

concurrency_limit number

execution_scripts object

disable_intercept boolean

block_ads boolean

block_analytics boolean

block_stylesheets boolean

run_in_background boolean

chunking_alg object

budget object

max_credits_per_page number

event_tracker object

crawl_timeout object

return_format string | array

readability boolean

css_extraction_map object

clean_html bool

return_json_data boolean

return_headers boolean

return_cookies boolean

return_page_links boolean

filter_output_svg bool

filter_output_images bool

filter_output_main_only bool

encoding string

return_embeddings boolean

proxy 'residential' | 'residential_fast' | 'residential_static' | 'mobile' | 'isp' | 'residential_premium' | 'residential_core' | 'residential_plus'

remote_proxy string

cookies string

headers object

blacklist array

whitelist array

fingerprint boolean

anti_bot boolean

stealth boolean

proxy_datacenter boolean

proxy_enabled boolean

proxy_lightning boolean

proxy_mobile boolean

cache boolean

delay number

respect_robots boolean

skip_config_checks boolean

service_worker_enabled boolean

storageless boolean

scroll number

device object

viewport object

automation_scripts object

gpt_config object

evaluate_on_new_document string

country_code string

locale string

Scrape