API Reference
The Spider API is based on REST. Our API is predictable, returns JSON-encoded responses, uses standardized HTTP response codes, and authentication. The API supports bulk updates. You can work on multiple objects per request for the core endpoints.
Authentication
Include your API key in the authorization header.
Authorization: Bearer sk-xxxx...Response formats
Set the content-type header to shape the response.
Prefix any path with v1 to lock the version. Requests on this page consume live credits.
Just getting started? Quickstart guide →
Not a developer? Use Spider's no-code options to get started without writing code.
https://api.spider.cloudCrawl
View detailsStart crawling website(s) to collect resources. You can pass an array of objects for the request body. This endpoint is also available via Proxy-Mode.
https://api.spider.cloud/crawlResponse
- urlCrawl API -stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
TipTo reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually. - limitCrawl API -numberDefault: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0to crawl all pages.TipIt is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found. - disable_hintsCrawl API -boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating
network_blacklist/network_whitelistrecommendations based on observed request-pattern outcomes). Hints are enabled by default for allsmartrequest modes.Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
TipIf you're tuning filters, keep hints enabled and pair withevent_trackerto see the complete URL list; once stable, you can flipdisable_hintson to lock behavior. - lite_modeCrawl API -boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
- network_blacklistCrawl API -string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets:
googletagmanager.com,doubleclick.net,maps.googleapis.com - Prefer specific domains over broad substrings to avoid breaking essential assets.
Tip Pair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next. - Good targets:
- network_whitelistCrawl API -string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party:
example.com,cdn.example.com - Add only what you observe you truly need (fonts/CDNs), then iterate.
TipPair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely. - Start with first-party:
- requestCrawl API -stringDefault: smarthttpchromesmart
The request type to perform. Use
smartto perform HTTP request by default until JavaScript rendering is needed for the HTML.TipThe request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults. - depthCrawl API -numberDefault: 25
The crawl limit for maximum depth. If
0, no limit will be applied.TipDepth allows you to place a distance between the base URL path and sub paths. - metadataCrawl API -booleanDefault: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
TipUsing metadata can help extract critical information to use for AI. - sessionCrawl API -booleanDefault: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
- request_timeoutCrawl API -numberDefault: 60
The timeout to use for request. Timeouts can be from
5-255seconds.TipThe timeout helps prevent long request times from hanging. - wait_forCrawl API -object
The
wait_forparameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:The key
idle_networkspecifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.The key
idle_network0specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.The key
almost_idle_network0specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.The key
selectorspecifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
domspecifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
delayspecifies a delay to wait for, with an optional timeout value.The key
page_navigationsset totruethen waiting for all page navigations will be handled.If
wait_foris not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.The values for the timeout duration are in the object shape
{ secs: 10, nanos: 0 }. - webhooksCrawl API -object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status.
{ destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool } - user_agentCrawl API -string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
- sitemapCrawl API -booleanDefault: false
Include the sitemap results to crawl.
TipThe sitemap allows you to include links that may not be exposed in the HTML. - sitemap_onlyCrawl API -booleanDefault: false
Only include the sitemap results to crawl.
TipUsing this option allows you to get only the pages on the sitemap without crawling the entire website. - sitemap_pathCrawl API -stringDefault: sitemap.xml
The sitemap URL to use when using
sitemap. - subdomainsCrawl API -booleanDefault: false
Allow subdomains to be included.
- tldCrawl API -booleanDefault: false
Allow TLD's to be included.
- root_selectorCrawl API -string
The root CSS query selector to use extracting content from the markup for the response.
- preserve_hostCrawl API -booleanDefault: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
- full_resourcesCrawl API -boolean
Crawl and download all the resources for a website.
TipCollect all the content from the website, including assets like images, videos, etc. - redirect_policyCrawl API -stringDefault: LooseLooseStrictNone
The network redirect policy to use when performing HTTP request.
TipLoosewill only capture the initial page redirect to the resource. Include the website inexternal_domainsto allow crawling outside of the domain. - external_domainsCrawl API -array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to
*to include all domains. - exclude_selectorCrawl API -string
A CSS query selector to use for ignoring content from the markup of the response.
- concurrency_limitCrawl API -number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
- execution_scriptsCrawl API -object
Run custom JavaScript on certain paths. Requires
chromeorsmartrequest mode. The values should be in the shape"/path_or_url": "custom js".TipCustom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page. - disable_interceptCrawl API -booleanDefault: false
Disable request interception when running request as
chromeorsmart. This may help bypass pages that use third-party scripts or external domains.TipCost and speed may increase when disabling this feature, as it removes native Chrome interception. - block_adsCrawl API -booleanDefault: true
Block advertisements when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_analyticsCrawl API -booleanDefault: true
Block analytics when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_stylesheetsCrawl API -booleanDefault: true
Block stylesheets when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - run_in_backgroundCrawl API -booleanDefault: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
TipRequiresstoragelessset to false orwebhooksto be enabled. - chunking_algCrawl API -objectByWordsByLinesByCharacterLengthBySentence
Use a chunking algorithm to segment your content output. Pass an object like
{ "type": "bysentence", "value": 2 }to split the text into an array by every 2 sentences. Works well with markdown or text formats.TipThe chunking algorithm allows you to prepare content for AI without needing extra code or loaders. - budgetCrawl API -object
Object that has paths with a counter for limiting the amount of pages. Use
{"*":1}for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g.{ "/docs/colors": 10, "/docs/": 100 }.TipThe budget explicitly allows you to set paths and limits for the crawl. - max_credits_per_pageCrawl API -numberSet the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- max_credits_allowedCrawl API -numberSet the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- event_trackerCrawl API -object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following
requestsandresponsesfor the network output of the page.automationwill send detailed information including a screenshot of each automation step used underautomation_scripts. - blacklistCrawl API -array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
- whitelistCrawl API -array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
- crawl_timeoutCrawl API -object
The
crawl_timeoutparameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.The values for the timeout duration are in the object shape
{ secs: 300, nanos: 0 }. - data_connectorsCrawl API -object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase.
{ s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }
- return_formatCrawl API -string | arrayDefault: rawmarkdowncommonmarkrawtextxmlbytesempty
The format to return the data in. Possible values are
markdown,commonmark,raw,text,xml,bytes, andempty. Userawto return the default format of the page likeHTMLetc.TipUsually you want to usemarkdownfor LLM processing ortext. If you need to store the files without losing any encoding, usebytesorraw. PDF transformations may take up to 1 cent per page for high accuracy. - readabilityCrawl API -booleanDefault: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
TipThis uses the Safari Reader Mode algorithm to extract only important information from the content. - css_extraction_mapCrawl API -object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
TipYou can scrape using CSS selectors at no extra cost. - link_rewriteCrawl API -json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a
typefield. Supported types:"replace"– simple substring replacement.
Fields:host?: string(optional) – only apply when the link's host matches this value (e.g."blog.example.com").find: string– substring to search for in the URL.replace_with: string– replacement substring.
"regex"– regex-based rewrite with capture groups.
Fields:host?: string(optional) – only apply for this host.pattern: string– regex applied to the full URL.replace_with: string– replacement string supporting$1,$2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
- clean_htmlCrawl API -boolean
Clean the HTML of unwanted attributes.
- filter_svgCrawl API -boolean
Filter SVG elements from the markup.
- filter_imagesCrawl API -boolean
Filter image elements from the markup.
- filter_main_onlyCrawl API -booleanDefault: true
Filter the main content from the markup excluding
nav,footer, andasideelements. - return_json_dataCrawl API -booleanDefault: false
Return the JSON data found in scripts used for SSR.
TipUseful for getting JSON-ready data for LLMs and data from websites built with Next.js etc. - return_headersCrawl API -booleanDefault: false
Return the HTTP response headers with the results.
TipGetting the HTTP headers can help setup authentication flows. - return_cookiesCrawl API -booleanDefault: false
Return the HTTP response cookies with the results.
TipGetting the HTTP cookies can help setup authentication SSR flows. - return_page_linksCrawl API -booleanDefault: false
Return the links found on each page.
TipGetting the links can help index the reference locations found for the resource. - filter_output_svgCrawl API -boolean
Filter the svg tags from the output.
- filter_output_imagesCrawl API -boolean
Filter the images from the output.
- filter_output_main_onlyCrawl API -boolean
Filter the nav, aside, and footer from the output.
- encodingCrawl API -string
The type of encoding to use like
UTF-8,SHIFT_JIS, or etc. - return_embeddingsCrawl API -booleanDefault: false
Include OpenAI embeddings for
titleanddescription. Requiresmetadatato be enabled.TipIf you are embedding data, you can use these matrices as staples for most vector baseline operations.
- proxyCrawl API -'residential' | 'mobile' | 'isp'residentialmobileisp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other
proxy_*shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.TipEach pool carries a different price multiplier (from ×1.2 forresidentialup to ×2 formobile). - remote_proxyCrawl API -string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
TipUse your own proxy to bypass any firewall as needed or connect to private web servers. - cookiesCrawl API -string
Add HTTP cookies to use for request.
TipSet the cookie value for pages that use SSR authentication. - headersCrawl API -object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
TipUsing HTTP headers can help with authenticated pages that use theauthorizationheader field. - fingerprintCrawl API -booleanDefault: true
Use advanced fingerprint detection for chrome.
TipSet this value to help crawl when websites require a fingerprint. - stealthCrawl API -booleanDefault: true
Use stealth mode for headless chrome request to help prevent being blocked.
TipSet to true to almost guarantee not being detected by anything. - proxy_enabledCrawl API -booleanDefault: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
TipUsing this configuration can help when network requests are blocked. This setup increases the cost forfile_costandbytes_transferred_cost, but only by 1.5×.
- cacheCrawl API -boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true.Accepts either:
true/false- A cache control object:
maxAge(ms) — freshness window (default:172800000= 2 days). Set0for always fetch fresh.allowStale— serve cached results even if stale.period— RFC3339 timestamp cutoff (overridesmaxAge), e.g."2025-11-29T12:00:00Z"skipBrowser— skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
- Standard routes (
/crawl,/scrape,/unblocker) — cache istruewithskipBrowserenabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, setcache: falseor{ "skipBrowser": false }. - AI routes (
/ai/crawl,/ai/scrape, etc.) — cache istruebutskipBrowseris not enabled. AI routes always use the browser to ensure live page content for extraction.
TipCaching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses. - delayCrawl API -numberDefault: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
TipUsing a delay can help with websites that are set on a cron and do not require immediate data retrieval. - respect_robotsCrawl API -booleanDefault: true
Respect the robots.txt file for crawling.
TipIf you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value tofalsecould help. Use this config sparingly. - skip_config_checksCrawl API -booleanDefault: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
- service_worker_enabledCrawl API -booleanDefault: true
Allow the website to use Service Workers as needed.
TipEnabling service workers can allow websites that explicitly run background tasks to load data. - storagelessCrawl API -booleanDefault: true
Prevent storing any type of data for the request including storage.
- scrollCrawl API -number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the
wait_forparameters. Requireschromerequest mode.TipUsewait_forto scroll until a condition is met anddisable_interceptto get data from the network regardless of hostname. - viewportCrawl API -object
Configure the viewport for chrome.
TipTo emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414). - automation_scriptsCrawl API -object
Run custom web automated tasks on certain paths. Requires
chromeorsmartrequest mode.Below are the available actions for web automation:- Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" } - Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" } - ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" } - ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } } - ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true } - ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } } - ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } } - ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } } - ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } } - Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 } - WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true } - WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" } - WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } } - WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" } - WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } } - ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 } - ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 } - Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } } - Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } } - InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 } - Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } } - ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
TipCustom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page. - Evaluate: Runs custom JavaScript code.
- evaluate_on_new_documentCrawl API -stringSet a custom script to evaluate on new document creation.
- country_codeCrawl API -string
Set a ISO country code for proxy connections. View the locations list for available countries.
TipThe country code allows you to run requests in regions where access to the website is restricted to within that specific region. - localeCrawl API -string
The locale to use for request, example
en-US.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/json',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/crawl',
headers=headers, json=json_data)
print(response.json())[ { "content": "<resource>...", "error": null, "status": 200, "duration_elapsed_ms": 122, "costs": { "ai_cost": 0, "compute_cost": 0.00001, "file_cost": 0.00002, "bytes_transferred_cost": 0.00002, "total_cost": 0.00004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Scrape
View detailsStart scraping a single page on website(s) to collect resources. You can pass an array of objects for the request body. This endpoint is also available via Proxy-Mode.
https://api.spider.cloud/scrapeResponse
- urlScrape API -stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
TipTo reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually. - disable_hintsScrape API -boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating
network_blacklist/network_whitelistrecommendations based on observed request-pattern outcomes). Hints are enabled by default for allsmartrequest modes.Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
TipIf you're tuning filters, keep hints enabled and pair withevent_trackerto see the complete URL list; once stable, you can flipdisable_hintson to lock behavior. - lite_modeScrape API -boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
- network_blacklistScrape API -string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets:
googletagmanager.com,doubleclick.net,maps.googleapis.com - Prefer specific domains over broad substrings to avoid breaking essential assets.
Tip Pair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next. - Good targets:
- network_whitelistScrape API -string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party:
example.com,cdn.example.com - Add only what you observe you truly need (fonts/CDNs), then iterate.
TipPair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely. - Start with first-party:
- requestScrape API -stringDefault: smarthttpchromesmart
The request type to perform. Use
smartto perform HTTP request by default until JavaScript rendering is needed for the HTML.TipThe request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults. - metadataScrape API -booleanDefault: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
TipUsing metadata can help extract critical information to use for AI. - sessionScrape API -booleanDefault: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
- request_timeoutScrape API -numberDefault: 60
The timeout to use for request. Timeouts can be from
5-255seconds.TipThe timeout helps prevent long request times from hanging. - wait_forScrape API -object
The
wait_forparameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:The key
idle_networkspecifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.The key
idle_network0specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.The key
almost_idle_network0specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.The key
selectorspecifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
domspecifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
delayspecifies a delay to wait for, with an optional timeout value.The key
page_navigationsset totruethen waiting for all page navigations will be handled.If
wait_foris not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.The values for the timeout duration are in the object shape
{ secs: 10, nanos: 0 }. - webhooksScrape API -object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status.
{ destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool } - user_agentScrape API -string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
- sitemapScrape API -booleanDefault: false
Include the sitemap results to crawl.
TipThe sitemap allows you to include links that may not be exposed in the HTML. - sitemap_onlyScrape API -booleanDefault: false
Only include the sitemap results to crawl.
TipUsing this option allows you to get only the pages on the sitemap without crawling the entire website. - sitemap_pathScrape API -stringDefault: sitemap.xml
The sitemap URL to use when using
sitemap. - subdomainsScrape API -booleanDefault: false
Allow subdomains to be included.
- tldScrape API -booleanDefault: false
Allow TLD's to be included.
- root_selectorScrape API -string
The root CSS query selector to use extracting content from the markup for the response.
- preserve_hostScrape API -booleanDefault: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
- full_resourcesScrape API -boolean
Crawl and download all the resources for a website.
TipCollect all the content from the website, including assets like images, videos, etc. - redirect_policyScrape API -stringDefault: LooseLooseStrictNone
The network redirect policy to use when performing HTTP request.
TipLoosewill only capture the initial page redirect to the resource. Include the website inexternal_domainsto allow crawling outside of the domain. - external_domainsScrape API -array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to
*to include all domains. - exclude_selectorScrape API -string
A CSS query selector to use for ignoring content from the markup of the response.
- concurrency_limitScrape API -number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
- execution_scriptsScrape API -object
Run custom JavaScript on certain paths. Requires
chromeorsmartrequest mode. The values should be in the shape"/path_or_url": "custom js".TipCustom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page. - disable_interceptScrape API -booleanDefault: false
Disable request interception when running request as
chromeorsmart. This may help bypass pages that use third-party scripts or external domains.TipCost and speed may increase when disabling this feature, as it removes native Chrome interception. - block_adsScrape API -booleanDefault: true
Block advertisements when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_analyticsScrape API -booleanDefault: true
Block analytics when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_stylesheetsScrape API -booleanDefault: true
Block stylesheets when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - run_in_backgroundScrape API -booleanDefault: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
TipRequiresstoragelessset to false orwebhooksto be enabled. - chunking_algScrape API -objectByWordsByLinesByCharacterLengthBySentence
Use a chunking algorithm to segment your content output. Pass an object like
{ "type": "bysentence", "value": 2 }to split the text into an array by every 2 sentences. Works well with markdown or text formats.TipThe chunking algorithm allows you to prepare content for AI without needing extra code or loaders. - budgetScrape API -object
Object that has paths with a counter for limiting the amount of pages. Use
{"*":1}for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g.{ "/docs/colors": 10, "/docs/": 100 }.TipThe budget explicitly allows you to set paths and limits for the crawl. - max_credits_per_pageScrape API -numberSet the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- max_credits_allowedScrape API -numberSet the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- event_trackerScrape API -object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following
requestsandresponsesfor the network output of the page.automationwill send detailed information including a screenshot of each automation step used underautomation_scripts. - blacklistScrape API -array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
- whitelistScrape API -array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
- crawl_timeoutScrape API -object
The
crawl_timeoutparameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.The values for the timeout duration are in the object shape
{ secs: 300, nanos: 0 }. - data_connectorsScrape API -object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase.
{ s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool } - full_pageScrape API -booleanDefault: true
Take a screenshot of the full page.
- return_formatScrape API -string | arrayDefault: rawmarkdowncommonmarkrawtextxmlbytesempty
The format to return the data in. Possible values are
markdown,commonmark,raw,text,xml,bytes, andempty. Userawto return the default format of the page likeHTMLetc.TipUsually you want to usemarkdownfor LLM processing ortext. If you need to store the files without losing any encoding, usebytesorraw. PDF transformations may take up to 1 cent per page for high accuracy. - readabilityScrape API -booleanDefault: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
TipThis uses the Safari Reader Mode algorithm to extract only important information from the content. - css_extraction_mapScrape API -object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
TipYou can scrape using CSS selectors at no extra cost. - link_rewriteScrape API -json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a
typefield. Supported types:"replace"– simple substring replacement.
Fields:host?: string(optional) – only apply when the link's host matches this value (e.g."blog.example.com").find: string– substring to search for in the URL.replace_with: string– replacement substring.
"regex"– regex-based rewrite with capture groups.
Fields:host?: string(optional) – only apply for this host.pattern: string– regex applied to the full URL.replace_with: string– replacement string supporting$1,$2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
- clean_htmlScrape API -boolean
Clean the HTML of unwanted attributes.
- filter_svgScrape API -boolean
Filter SVG elements from the markup.
- filter_imagesScrape API -boolean
Filter image elements from the markup.
- filter_main_onlyScrape API -booleanDefault: true
Filter the main content from the markup excluding
nav,footer, andasideelements. - return_json_dataScrape API -booleanDefault: false
Return the JSON data found in scripts used for SSR.
TipUseful for getting JSON-ready data for LLMs and data from websites built with Next.js etc. - return_headersScrape API -booleanDefault: false
Return the HTTP response headers with the results.
TipGetting the HTTP headers can help setup authentication flows. - return_cookiesScrape API -booleanDefault: false
Return the HTTP response cookies with the results.
TipGetting the HTTP cookies can help setup authentication SSR flows. - return_page_linksScrape API -booleanDefault: false
Return the links found on each page.
TipGetting the links can help index the reference locations found for the resource. - filter_output_svgScrape API -boolean
Filter the svg tags from the output.
- filter_output_imagesScrape API -boolean
Filter the images from the output.
- filter_output_main_onlyScrape API -boolean
Filter the nav, aside, and footer from the output.
- encodingScrape API -string
The type of encoding to use like
UTF-8,SHIFT_JIS, or etc. - return_embeddingsScrape API -booleanDefault: false
Include OpenAI embeddings for
titleanddescription. Requiresmetadatato be enabled.TipIf you are embedding data, you can use these matrices as staples for most vector baseline operations. - binaryScrape API -boolean
Return the image as binary instead of base64.
- cdp_paramsScrape API -objectDefault: null
The settings to use to adjust clip, format, quality, and more.
- proxyScrape API -'residential' | 'mobile' | 'isp'residentialmobileisp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other
proxy_*shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.TipEach pool carries a different price multiplier (from ×1.2 forresidentialup to ×2 formobile). - remote_proxyScrape API -string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
TipUse your own proxy to bypass any firewall as needed or connect to private web servers. - cookiesScrape API -string
Add HTTP cookies to use for request.
TipSet the cookie value for pages that use SSR authentication. - headersScrape API -object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
TipUsing HTTP headers can help with authenticated pages that use theauthorizationheader field. - fingerprintScrape API -booleanDefault: true
Use advanced fingerprint detection for chrome.
TipSet this value to help crawl when websites require a fingerprint. - stealthScrape API -booleanDefault: true
Use stealth mode for headless chrome request to help prevent being blocked.
TipSet to true to almost guarantee not being detected by anything. - proxy_enabledScrape API -booleanDefault: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
TipUsing this configuration can help when network requests are blocked. This setup increases the cost forfile_costandbytes_transferred_cost, but only by 1.5×.
- cacheScrape API -boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true.Accepts either:
true/false- A cache control object:
maxAge(ms) — freshness window (default:172800000= 2 days). Set0for always fetch fresh.allowStale— serve cached results even if stale.period— RFC3339 timestamp cutoff (overridesmaxAge), e.g."2025-11-29T12:00:00Z"skipBrowser— skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
- Standard routes (
/crawl,/scrape,/unblocker) — cache istruewithskipBrowserenabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, setcache: falseor{ "skipBrowser": false }. - AI routes (
/ai/crawl,/ai/scrape, etc.) — cache istruebutskipBrowseris not enabled. AI routes always use the browser to ensure live page content for extraction.
TipCaching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses. - respect_robotsScrape API -booleanDefault: true
Respect the robots.txt file for crawling.
TipIf you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value tofalsecould help. Use this config sparingly. - skip_config_checksScrape API -booleanDefault: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
- service_worker_enabledScrape API -booleanDefault: true
Allow the website to use Service Workers as needed.
TipEnabling service workers can allow websites that explicitly run background tasks to load data. - storagelessScrape API -booleanDefault: true
Prevent storing any type of data for the request including storage.
- block_imagesScrape API -booleanDefault: false
Block the images from loading to speed up the screenshot.
- fastScrape API -booleanDefault: true
Use fast screenshot mode for speed-optimized rendering. Set to
falsefor high-fidelity rendering that supports iframes, complex PDFs, and accurate visual output. - omit_backgroundScrape API -booleanDefault: false
Omit the background from loading.
- scrollScrape API -number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the
wait_forparameters. Requireschromerequest mode.TipUsewait_forto scroll until a condition is met anddisable_interceptto get data from the network regardless of hostname. - viewportScrape API -object
Configure the viewport for chrome.
TipTo emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414). - automation_scriptsScrape API -object
Run custom web automated tasks on certain paths. Requires
chromeorsmartrequest mode.Below are the available actions for web automation:- Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" } - Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" } - ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" } - ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } } - ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true } - ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } } - ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } } - ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } } - ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } } - Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 } - WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true } - WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" } - WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } } - WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" } - WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } } - ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 } - ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 } - Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } } - Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } } - InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 } - Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } } - ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
TipCustom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page. - Evaluate: Runs custom JavaScript code.
- evaluate_on_new_documentScrape API -stringSet a custom script to evaluate on new document creation.
- country_codeScrape API -string
Set a ISO country code for proxy connections. View the locations list for available countries.
TipThe country code allows you to run requests in regions where access to the website is restricted to within that specific region. - localeScrape API -string
The locale to use for request, example
en-US.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/json',
}
json_data = {"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/scrape',
headers=headers, json=json_data)
print(response.json())[ { "content": "<resource>...", "error": null, "status": 200, "duration_elapsed_ms": 122, "costs": { "ai_cost": 0, "compute_cost": 0.00001, "file_cost": 0.00002, "bytes_transferred_cost": 0.00002, "total_cost": 0.00004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Unblocker
View detailsStart unblocking challenging website(s) to collect data. You can pass an array of objects for the request body. Cost 10-40 credits additional per success.
https://api.spider.cloud/unblockerResponse
- urlUnblocker API -stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
TipTo reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually. - disable_hintsUnblocker API -boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating
network_blacklist/network_whitelistrecommendations based on observed request-pattern outcomes). Hints are enabled by default for allsmartrequest modes.Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
TipIf you're tuning filters, keep hints enabled and pair withevent_trackerto see the complete URL list; once stable, you can flipdisable_hintson to lock behavior. - lite_modeUnblocker API -boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
- network_blacklistUnblocker API -string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets:
googletagmanager.com,doubleclick.net,maps.googleapis.com - Prefer specific domains over broad substrings to avoid breaking essential assets.
Tip Pair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next. - Good targets:
- network_whitelistUnblocker API -string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party:
example.com,cdn.example.com - Add only what you observe you truly need (fonts/CDNs), then iterate.
TipPair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely. - Start with first-party:
- requestUnblocker API -stringDefault: smarthttpchromesmart
The request type to perform. Use
smartto perform HTTP request by default until JavaScript rendering is needed for the HTML.TipThe request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults. - metadataUnblocker API -booleanDefault: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
TipUsing metadata can help extract critical information to use for AI. - sessionUnblocker API -booleanDefault: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
- request_timeoutUnblocker API -numberDefault: 60
The timeout to use for request. Timeouts can be from
5-255seconds.TipThe timeout helps prevent long request times from hanging. - wait_forUnblocker API -object
The
wait_forparameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:The key
idle_networkspecifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.The key
idle_network0specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.The key
almost_idle_network0specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.The key
selectorspecifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
domspecifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
delayspecifies a delay to wait for, with an optional timeout value.The key
page_navigationsset totruethen waiting for all page navigations will be handled.If
wait_foris not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.The values for the timeout duration are in the object shape
{ secs: 10, nanos: 0 }. - webhooksUnblocker API -object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status.
{ destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool } - user_agentUnblocker API -string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
- sitemapUnblocker API -booleanDefault: false
Include the sitemap results to crawl.
TipThe sitemap allows you to include links that may not be exposed in the HTML. - sitemap_onlyUnblocker API -booleanDefault: false
Only include the sitemap results to crawl.
TipUsing this option allows you to get only the pages on the sitemap without crawling the entire website. - sitemap_pathUnblocker API -stringDefault: sitemap.xml
The sitemap URL to use when using
sitemap. - subdomainsUnblocker API -booleanDefault: false
Allow subdomains to be included.
- tldUnblocker API -booleanDefault: false
Allow TLD's to be included.
- root_selectorUnblocker API -string
The root CSS query selector to use extracting content from the markup for the response.
- preserve_hostUnblocker API -booleanDefault: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
- full_resourcesUnblocker API -boolean
Crawl and download all the resources for a website.
TipCollect all the content from the website, including assets like images, videos, etc. - redirect_policyUnblocker API -stringDefault: LooseLooseStrictNone
The network redirect policy to use when performing HTTP request.
TipLoosewill only capture the initial page redirect to the resource. Include the website inexternal_domainsto allow crawling outside of the domain. - external_domainsUnblocker API -array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to
*to include all domains. - exclude_selectorUnblocker API -string
A CSS query selector to use for ignoring content from the markup of the response.
- concurrency_limitUnblocker API -number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
- execution_scriptsUnblocker API -object
Run custom JavaScript on certain paths. Requires
chromeorsmartrequest mode. The values should be in the shape"/path_or_url": "custom js".TipCustom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page. - disable_interceptUnblocker API -booleanDefault: false
Disable request interception when running request as
chromeorsmart. This may help bypass pages that use third-party scripts or external domains.TipCost and speed may increase when disabling this feature, as it removes native Chrome interception. - block_adsUnblocker API -booleanDefault: true
Block advertisements when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_analyticsUnblocker API -booleanDefault: true
Block analytics when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_stylesheetsUnblocker API -booleanDefault: true
Block stylesheets when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - run_in_backgroundUnblocker API -booleanDefault: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
TipRequiresstoragelessset to false orwebhooksto be enabled. - chunking_algUnblocker API -objectByWordsByLinesByCharacterLengthBySentence
Use a chunking algorithm to segment your content output. Pass an object like
{ "type": "bysentence", "value": 2 }to split the text into an array by every 2 sentences. Works well with markdown or text formats.TipThe chunking algorithm allows you to prepare content for AI without needing extra code or loaders. - budgetUnblocker API -object
Object that has paths with a counter for limiting the amount of pages. Use
{"*":1}for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g.{ "/docs/colors": 10, "/docs/": 100 }.TipThe budget explicitly allows you to set paths and limits for the crawl. - max_credits_per_pageUnblocker API -numberSet the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- max_credits_allowedUnblocker API -numberSet the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- event_trackerUnblocker API -object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following
requestsandresponsesfor the network output of the page.automationwill send detailed information including a screenshot of each automation step used underautomation_scripts. - blacklistUnblocker API -array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
- whitelistUnblocker API -array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
- crawl_timeoutUnblocker API -object
The
crawl_timeoutparameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.The values for the timeout duration are in the object shape
{ secs: 300, nanos: 0 }. - data_connectorsUnblocker API -object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase.
{ s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool } - full_pageUnblocker API -booleanDefault: true
Take a screenshot of the full page.
- return_formatUnblocker API -string | arrayDefault: rawmarkdowncommonmarkrawtextxmlbytesempty
The format to return the data in. Possible values are
markdown,commonmark,raw,text,xml,bytes, andempty. Userawto return the default format of the page likeHTMLetc.TipUsually you want to usemarkdownfor LLM processing ortext. If you need to store the files without losing any encoding, usebytesorraw. PDF transformations may take up to 1 cent per page for high accuracy. - readabilityUnblocker API -booleanDefault: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
TipThis uses the Safari Reader Mode algorithm to extract only important information from the content. - css_extraction_mapUnblocker API -object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
TipYou can scrape using CSS selectors at no extra cost. - link_rewriteUnblocker API -json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a
typefield. Supported types:"replace"– simple substring replacement.
Fields:host?: string(optional) – only apply when the link's host matches this value (e.g."blog.example.com").find: string– substring to search for in the URL.replace_with: string– replacement substring.
"regex"– regex-based rewrite with capture groups.
Fields:host?: string(optional) – only apply for this host.pattern: string– regex applied to the full URL.replace_with: string– replacement string supporting$1,$2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
- clean_htmlUnblocker API -boolean
Clean the HTML of unwanted attributes.
- filter_svgUnblocker API -boolean
Filter SVG elements from the markup.
- filter_imagesUnblocker API -boolean
Filter image elements from the markup.
- filter_main_onlyUnblocker API -booleanDefault: true
Filter the main content from the markup excluding
nav,footer, andasideelements. - return_json_dataUnblocker API -booleanDefault: false
Return the JSON data found in scripts used for SSR.
TipUseful for getting JSON-ready data for LLMs and data from websites built with Next.js etc. - return_headersUnblocker API -booleanDefault: false
Return the HTTP response headers with the results.
TipGetting the HTTP headers can help setup authentication flows. - return_cookiesUnblocker API -booleanDefault: false
Return the HTTP response cookies with the results.
TipGetting the HTTP cookies can help setup authentication SSR flows. - return_page_linksUnblocker API -booleanDefault: false
Return the links found on each page.
TipGetting the links can help index the reference locations found for the resource. - filter_output_svgUnblocker API -boolean
Filter the svg tags from the output.
- filter_output_imagesUnblocker API -boolean
Filter the images from the output.
- filter_output_main_onlyUnblocker API -boolean
Filter the nav, aside, and footer from the output.
- encodingUnblocker API -string
The type of encoding to use like
UTF-8,SHIFT_JIS, or etc. - return_embeddingsUnblocker API -booleanDefault: false
Include OpenAI embeddings for
titleanddescription. Requiresmetadatato be enabled.TipIf you are embedding data, you can use these matrices as staples for most vector baseline operations. - binaryUnblocker API -boolean
Return the image as binary instead of base64.
- cdp_paramsUnblocker API -objectDefault: null
The settings to use to adjust clip, format, quality, and more.
- proxyUnblocker API -'residential' | 'mobile' | 'isp'residentialmobileisp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other
proxy_*shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.TipEach pool carries a different price multiplier (from ×1.2 forresidentialup to ×2 formobile). - remote_proxyUnblocker API -string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
TipUse your own proxy to bypass any firewall as needed or connect to private web servers. - cookiesUnblocker API -string
Add HTTP cookies to use for request.
TipSet the cookie value for pages that use SSR authentication. - headersUnblocker API -object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
TipUsing HTTP headers can help with authenticated pages that use theauthorizationheader field. - fingerprintUnblocker API -booleanDefault: true
Use advanced fingerprint detection for chrome.
TipSet this value to help crawl when websites require a fingerprint. - stealthUnblocker API -booleanDefault: true
Use stealth mode for headless chrome request to help prevent being blocked.
TipSet to true to almost guarantee not being detected by anything. - proxy_enabledUnblocker API -booleanDefault: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
TipUsing this configuration can help when network requests are blocked. This setup increases the cost forfile_costandbytes_transferred_cost, but only by 1.5×.
- cacheUnblocker API -boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true.Accepts either:
true/false- A cache control object:
maxAge(ms) — freshness window (default:172800000= 2 days). Set0for always fetch fresh.allowStale— serve cached results even if stale.period— RFC3339 timestamp cutoff (overridesmaxAge), e.g."2025-11-29T12:00:00Z"skipBrowser— skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
- Standard routes (
/crawl,/scrape,/unblocker) — cache istruewithskipBrowserenabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, setcache: falseor{ "skipBrowser": false }. - AI routes (
/ai/crawl,/ai/scrape, etc.) — cache istruebutskipBrowseris not enabled. AI routes always use the browser to ensure live page content for extraction.
TipCaching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses. - respect_robotsUnblocker API -booleanDefault: true
Respect the robots.txt file for crawling.
TipIf you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value tofalsecould help. Use this config sparingly. - skip_config_checksUnblocker API -booleanDefault: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
- service_worker_enabledUnblocker API -booleanDefault: true
Allow the website to use Service Workers as needed.
TipEnabling service workers can allow websites that explicitly run background tasks to load data. - storagelessUnblocker API -booleanDefault: true
Prevent storing any type of data for the request including storage.
- block_imagesUnblocker API -booleanDefault: false
Block the images from loading to speed up the screenshot.
- fastUnblocker API -booleanDefault: true
Use fast screenshot mode for speed-optimized rendering. Set to
falsefor high-fidelity rendering that supports iframes, complex PDFs, and accurate visual output. - omit_backgroundUnblocker API -booleanDefault: false
Omit the background from loading.
- scrollUnblocker API -number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the
wait_forparameters. Requireschromerequest mode.TipUsewait_forto scroll until a condition is met anddisable_interceptto get data from the network regardless of hostname. - viewportUnblocker API -object
Configure the viewport for chrome.
TipTo emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414). - automation_scriptsUnblocker API -object
Run custom web automated tasks on certain paths. Requires
chromeorsmartrequest mode.Below are the available actions for web automation:- Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" } - Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" } - ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" } - ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } } - ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true } - ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } } - ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } } - ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } } - ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } } - Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 } - WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true } - WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" } - WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } } - WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" } - WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } } - ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 } - ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 } - Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } } - Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } } - InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 } - Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } } - ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
TipCustom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page. - Evaluate: Runs custom JavaScript code.
- evaluate_on_new_documentUnblocker API -stringSet a custom script to evaluate on new document creation.
- country_codeUnblocker API -string
Set a ISO country code for proxy connections. View the locations list for available countries.
TipThe country code allows you to run requests in regions where access to the website is restricted to within that specific region. - localeUnblocker API -string
The locale to use for request, example
en-US.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/json',
}
json_data = {"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/unblocker',
headers=headers, json=json_data)
print(response.json())[ { "url": "https://spider.cloud", "status": 200, "cookies": { "a": "something", "b": "something2" }, "headers": { "x-id": 123, "x-cookie": 123 }, "status": 200, "costs": { "ai_cost": 0.001, "ai_cost_formatted": "0.0010", "bytes_transferred_cost": 3.1649999999999997e-9, "bytes_transferred_cost_formatted": "0.0000000031649999999999997240", "compute_cost": 0.0, "compute_cost_formatted": "0", "file_cost": 0.000029291250000000002, "file_cost_formatted": "0.0000292912499999999997868372", "total_cost": 0.0010292944150000001, "total_cost_formatted": "0.0010292944149999999997865612", "transform_cost": 0.0, "transform_cost_formatted": "0" }, "content": "<html>...</html>", "error": null }, // more content... ]
Search
View detailsPerform a Google search to gather a list of websites for crawling and resource collection, including fallback options if the query yields no results. You can pass an array of objects for the request body. This endpoint is also available via Proxy-Mode.
https://api.spider.cloud/searchResponse
- searchSearch API -stringrequired
The search query you want to search for.
- limitSearch API -numberDefault: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0to crawl all pages.TipIt is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found. - quick_searchSearch API -boolean
Prioritize speed over output quantity.
- disable_hintsSearch API -boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating
network_blacklist/network_whitelistrecommendations based on observed request-pattern outcomes). Hints are enabled by default for allsmartrequest modes.Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
TipIf you're tuning filters, keep hints enabled and pair withevent_trackerto see the complete URL list; once stable, you can flipdisable_hintson to lock behavior. - lite_modeSearch API -boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
- network_blacklistSearch API -string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets:
googletagmanager.com,doubleclick.net,maps.googleapis.com - Prefer specific domains over broad substrings to avoid breaking essential assets.
Tip Pair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next. - Good targets:
- network_whitelistSearch API -string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party:
example.com,cdn.example.com - Add only what you observe you truly need (fonts/CDNs), then iterate.
TipPair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely. - Start with first-party:
- search_limitSearch API -number
The limit amount of URLs to fetch or crawl from the search results. Remove the value or set it to
0to crawl all URLs from the realtime search results. This is a shorthand if you do not want to usenum. - fetch_page_contentSearch API -booleanDefault: false
Fetch all the content of the websites by performing crawls. If disabled, only the search results are returned with the meta
titleanddescription. - requestSearch API -stringDefault: smarthttpchromesmart
The request type to perform. Use
smartto perform HTTP request by default until JavaScript rendering is needed for the HTML.TipThe request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults. - depthSearch API -numberDefault: 25
The crawl limit for maximum depth. If
0, no limit will be applied.TipDepth allows you to place a distance between the base URL path and sub paths. - metadataSearch API -booleanDefault: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
TipUsing metadata can help extract critical information to use for AI. - sessionSearch API -booleanDefault: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
- request_timeoutSearch API -numberDefault: 60
The timeout to use for request. Timeouts can be from
5-255seconds.TipThe timeout helps prevent long request times from hanging. - wait_forSearch API -object
The
wait_forparameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:The key
idle_networkspecifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.The key
idle_network0specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.The key
almost_idle_network0specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.The key
selectorspecifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
domspecifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
delayspecifies a delay to wait for, with an optional timeout value.The key
page_navigationsset totruethen waiting for all page navigations will be handled.If
wait_foris not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.The values for the timeout duration are in the object shape
{ secs: 10, nanos: 0 }. - webhooksSearch API -object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status.
{ destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool } - user_agentSearch API -string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
- sitemapSearch API -booleanDefault: false
Include the sitemap results to crawl.
TipThe sitemap allows you to include links that may not be exposed in the HTML. - sitemap_onlySearch API -booleanDefault: false
Only include the sitemap results to crawl.
TipUsing this option allows you to get only the pages on the sitemap without crawling the entire website. - sitemap_pathSearch API -stringDefault: sitemap.xml
The sitemap URL to use when using
sitemap. - subdomainsSearch API -booleanDefault: false
Allow subdomains to be included.
- tldSearch API -booleanDefault: false
Allow TLD's to be included.
- root_selectorSearch API -string
The root CSS query selector to use extracting content from the markup for the response.
- preserve_hostSearch API -booleanDefault: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
- full_resourcesSearch API -boolean
Crawl and download all the resources for a website.
TipCollect all the content from the website, including assets like images, videos, etc. - redirect_policySearch API -stringDefault: LooseLooseStrictNone
The network redirect policy to use when performing HTTP request.
TipLoosewill only capture the initial page redirect to the resource. Include the website inexternal_domainsto allow crawling outside of the domain. - external_domainsSearch API -array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to
*to include all domains. - exclude_selectorSearch API -string
A CSS query selector to use for ignoring content from the markup of the response.
- concurrency_limitSearch API -number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
- execution_scriptsSearch API -object
Run custom JavaScript on certain paths. Requires
chromeorsmartrequest mode. The values should be in the shape"/path_or_url": "custom js".TipCustom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page. - disable_interceptSearch API -booleanDefault: false
Disable request interception when running request as
chromeorsmart. This may help bypass pages that use third-party scripts or external domains.TipCost and speed may increase when disabling this feature, as it removes native Chrome interception. - block_adsSearch API -booleanDefault: true
Block advertisements when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_analyticsSearch API -booleanDefault: true
Block analytics when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_stylesheetsSearch API -booleanDefault: true
Block stylesheets when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - run_in_backgroundSearch API -booleanDefault: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
TipRequiresstoragelessset to false orwebhooksto be enabled. - chunking_algSearch API -objectByWordsByLinesByCharacterLengthBySentence
Use a chunking algorithm to segment your content output. Pass an object like
{ "type": "bysentence", "value": 2 }to split the text into an array by every 2 sentences. Works well with markdown or text formats.TipThe chunking algorithm allows you to prepare content for AI without needing extra code or loaders. - budgetSearch API -object
Object that has paths with a counter for limiting the amount of pages. Use
{"*":1}for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g.{ "/docs/colors": 10, "/docs/": 100 }.TipThe budget explicitly allows you to set paths and limits for the crawl. - max_credits_per_pageSearch API -numberSet the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- max_credits_allowedSearch API -numberSet the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- event_trackerSearch API -object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following
requestsandresponsesfor the network output of the page.automationwill send detailed information including a screenshot of each automation step used underautomation_scripts. - blacklistSearch API -array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
- whitelistSearch API -array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
- auto_paginationSearch API -boolean
Automatically paginates to fetch the exact number of desired results, as specified by the
numparameter. Note that credit usage may increase, and response time may be slower when retrieving larger result sets. - crawl_timeoutSearch API -object
The
crawl_timeoutparameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.The values for the timeout duration are in the object shape
{ secs: 300, nanos: 0 }. - data_connectorsSearch API -object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase.
{ s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool } - numSearch API -number
The maximum number of results to return for the search.
- pageSearch API -number
The page number for the search results.
- tbsSearch API -'qdr:h' | 'qdr:d' | 'qdr:w' | 'qdr:m' | 'qdr:y'
Restrict results to a specific time range. Common options:
qdr:h(past hour),qdr:d(past 24 hours),qdr:w(past week),qdr:m(past month),qdr:y(past year).
- countrySearch API -string
The country code to use for the search. It's a two-letter country code. (e.g.
usfor the United States). - locationSearch API -string
The location from where you want the search to originate.
- languageSearch API -string
The language to use for the search. It's a two-letter language code (e.g.,
enfor English). - country_codeSearch API -string
Set a ISO country code for proxy connections. View the locations list for available countries.
TipThe country code allows you to run requests in regions where access to the website is restricted to within that specific region. - localeSearch API -string
The locale to use for request, example
en-US.
- return_formatSearch API -string | arrayDefault: rawmarkdowncommonmarkrawtextxmlbytesempty
The format to return the data in. Possible values are
markdown,commonmark,raw,text,xml,bytes, andempty. Userawto return the default format of the page likeHTMLetc.TipUsually you want to usemarkdownfor LLM processing ortext. If you need to store the files without losing any encoding, usebytesorraw. PDF transformations may take up to 1 cent per page for high accuracy. - readabilitySearch API -booleanDefault: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
TipThis uses the Safari Reader Mode algorithm to extract only important information from the content. - css_extraction_mapSearch API -object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
TipYou can scrape using CSS selectors at no extra cost. - link_rewriteSearch API -json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a
typefield. Supported types:"replace"– simple substring replacement.
Fields:host?: string(optional) – only apply when the link's host matches this value (e.g."blog.example.com").find: string– substring to search for in the URL.replace_with: string– replacement substring.
"regex"– regex-based rewrite with capture groups.
Fields:host?: string(optional) – only apply for this host.pattern: string– regex applied to the full URL.replace_with: string– replacement string supporting$1,$2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
- clean_htmlSearch API -boolean
Clean the HTML of unwanted attributes.
- filter_svgSearch API -boolean
Filter SVG elements from the markup.
- filter_imagesSearch API -boolean
Filter image elements from the markup.
- filter_main_onlySearch API -booleanDefault: true
Filter the main content from the markup excluding
nav,footer, andasideelements. - return_json_dataSearch API -booleanDefault: false
Return the JSON data found in scripts used for SSR.
TipUseful for getting JSON-ready data for LLMs and data from websites built with Next.js etc. - return_headersSearch API -booleanDefault: false
Return the HTTP response headers with the results.
TipGetting the HTTP headers can help setup authentication flows. - return_cookiesSearch API -booleanDefault: false
Return the HTTP response cookies with the results.
TipGetting the HTTP cookies can help setup authentication SSR flows. - return_page_linksSearch API -booleanDefault: false
Return the links found on each page.
TipGetting the links can help index the reference locations found for the resource. - filter_output_svgSearch API -boolean
Filter the svg tags from the output.
- filter_output_imagesSearch API -boolean
Filter the images from the output.
- filter_output_main_onlySearch API -boolean
Filter the nav, aside, and footer from the output.
- encodingSearch API -string
The type of encoding to use like
UTF-8,SHIFT_JIS, or etc. - return_embeddingsSearch API -booleanDefault: false
Include OpenAI embeddings for
titleanddescription. Requiresmetadatato be enabled.TipIf you are embedding data, you can use these matrices as staples for most vector baseline operations.
- proxySearch API -'residential' | 'mobile' | 'isp'residentialmobileisp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other
proxy_*shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.TipEach pool carries a different price multiplier (from ×1.2 forresidentialup to ×2 formobile). - remote_proxySearch API -string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
TipUse your own proxy to bypass any firewall as needed or connect to private web servers. - cookiesSearch API -string
Add HTTP cookies to use for request.
TipSet the cookie value for pages that use SSR authentication. - headersSearch API -object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
TipUsing HTTP headers can help with authenticated pages that use theauthorizationheader field. - fingerprintSearch API -booleanDefault: true
Use advanced fingerprint detection for chrome.
TipSet this value to help crawl when websites require a fingerprint. - stealthSearch API -booleanDefault: true
Use stealth mode for headless chrome request to help prevent being blocked.
TipSet to true to almost guarantee not being detected by anything. - proxy_enabledSearch API -booleanDefault: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
TipUsing this configuration can help when network requests are blocked. This setup increases the cost forfile_costandbytes_transferred_cost, but only by 1.5×.
- cacheSearch API -boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true.Accepts either:
true/false- A cache control object:
maxAge(ms) — freshness window (default:172800000= 2 days). Set0for always fetch fresh.allowStale— serve cached results even if stale.period— RFC3339 timestamp cutoff (overridesmaxAge), e.g."2025-11-29T12:00:00Z"skipBrowser— skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
- Standard routes (
/crawl,/scrape,/unblocker) — cache istruewithskipBrowserenabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, setcache: falseor{ "skipBrowser": false }. - AI routes (
/ai/crawl,/ai/scrape, etc.) — cache istruebutskipBrowseris not enabled. AI routes always use the browser to ensure live page content for extraction.
TipCaching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses. - delaySearch API -numberDefault: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
TipUsing a delay can help with websites that are set on a cron and do not require immediate data retrieval. - respect_robotsSearch API -booleanDefault: true
Respect the robots.txt file for crawling.
TipIf you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value tofalsecould help. Use this config sparingly. - skip_config_checksSearch API -booleanDefault: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
- service_worker_enabledSearch API -booleanDefault: true
Allow the website to use Service Workers as needed.
TipEnabling service workers can allow websites that explicitly run background tasks to load data. - storagelessSearch API -booleanDefault: true
Prevent storing any type of data for the request including storage.
- scrollSearch API -number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the
wait_forparameters. Requireschromerequest mode.TipUsewait_forto scroll until a condition is met anddisable_interceptto get data from the network regardless of hostname. - viewportSearch API -object
Configure the viewport for chrome.
TipTo emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414). - automation_scriptsSearch API -object
Run custom web automated tasks on certain paths. Requires
chromeorsmartrequest mode.Below are the available actions for web automation:- Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" } - Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" } - ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" } - ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } } - ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true } - ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } } - ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } } - ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } } - ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } } - Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 } - WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true } - WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" } - WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } } - WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" } - WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } } - ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 } - ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 } - Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } } - Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } } - InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 } - Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } } - ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
TipCustom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page. - Evaluate: Runs custom JavaScript code.
- evaluate_on_new_documentSearch API -stringSet a custom script to evaluate on new document creation.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/json',
}
json_data = {"search":"sports news today","search_limit":3,"limit":5,"return_format":"markdown"}
response = requests.post('https://api.spider.cloud/search',
headers=headers, json=json_data)
print(response.json()){ "content": [ { "description": "Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.", "title": "ESPN - Serving Sports Fans. Anytime. Anywhere.", "url": "https://www.espn.com/" }, { "description": "Sports Illustrated, SI.com provides sports news, expert analysis, highlights, stats and scores for the NFL, NBA, MLB, NHL, college football, soccer, ...", "title": "Sports Illustrated", "url": "https://www.si.com/" }, { "description": "CBS Sports features live scoring, news, stats, and player info for NFL football, MLB baseball, NBA basketball, NHL hockey, college basketball and football.", "title": "CBS Sports - News, Live Scores, Schedules, Fantasy ...", "url": "https://www.cbssports.com/" }, { "description": "Sport is a form of physical activity or game. Often competitive and organized, sports use, maintain, or improve physical ability and skills.", "title": "Sport", "url": "https://en.wikipedia.org/wiki/Sport" }, { "description": "Watch FOX Sports and view live scores, odds, team news, player news, streams, videos, stats, standings & schedules covering NFL, MLB, NASCAR, WWE, NBA, NHL, ...", "title": "FOX Sports News, Scores, Schedules, Odds, Shows, Streams ...", "url": "https://www.foxsports.com/" }, { "description": "Founded in 1974 by tennis legend, Billie Jean King, the Women's Sports Foundation is dedicated to creating leaders by providing girls access to sports.", "title": "Women's Sports Foundation: Home", "url": "https://www.womenssportsfoundation.org/" }, { "description": "List of sports · Running. Marathon · Sprint · Mascot race · Airsoft · Laser tag · Paintball · Bobsleigh · Jack jumping · Luge · Shovel racing · Card stacking ...", "title": "List of sports", "url": "https://en.wikipedia.org/wiki/List_of_sports" }, { "description": "Stay up-to-date with the latest sports news and scores from NBC Sports.", "title": "NBC Sports - news, scores, stats, rumors, videos, and more", "url": "https://www.nbcsports.com/" }, { "description": "r/sports: Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.", "title": "r/sports", "url": "https://www.reddit.com/r/sports/" }, { "description": "The A-Z of sports covered by the BBC Sport team. Find all the latest live sports coverage, breaking news, results, scores, fixtures, tables, ...", "title": "AZ Sport", "url": "https://www.bbc.com/sport/all-sports" } ] }
Links
View detailsStart crawling a website(s) to collect links found. You can pass an array of objects for the request body. This endpoint can save on latency if you only need to index the content URLs. Also available via Proxy-Mode.
https://api.spider.cloud/linksResponse
- urlGet API -stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
TipTo reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually. - limitGet API -numberDefault: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0to crawl all pages.TipIt is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found. - disable_hintsGet API -boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating
network_blacklist/network_whitelistrecommendations based on observed request-pattern outcomes). Hints are enabled by default for allsmartrequest modes.Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
TipIf you're tuning filters, keep hints enabled and pair withevent_trackerto see the complete URL list; once stable, you can flipdisable_hintson to lock behavior. - lite_modeGet API -boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
- network_blacklistGet API -string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets:
googletagmanager.com,doubleclick.net,maps.googleapis.com - Prefer specific domains over broad substrings to avoid breaking essential assets.
Tip Pair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next. - Good targets:
- network_whitelistGet API -string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party:
example.com,cdn.example.com - Add only what you observe you truly need (fonts/CDNs), then iterate.
TipPair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely. - Start with first-party:
- requestGet API -stringDefault: smarthttpchromesmart
The request type to perform. Use
smartto perform HTTP request by default until JavaScript rendering is needed for the HTML.TipThe request mode greatly influences how the output will look. If the page is server-side rendered, you can stick to the defaults. - depthGet API -numberDefault: 25
The crawl limit for maximum depth. If
0, no limit will be applied.TipDepth allows you to place a distance between the base URL path and sub paths. - metadataGet API -booleanDefault: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
TipUsing metadata can help extract critical information to use for AI. - sessionGet API -booleanDefault: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
- request_timeoutGet API -numberDefault: 60
The timeout to use for request. Timeouts can be from
5-255seconds.TipThe timeout helps prevent long request times from hanging. - wait_forGet API -object
The
wait_forparameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:The key
idle_networkspecifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.The key
idle_network0specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.The key
almost_idle_network0specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.The key
selectorspecifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
domspecifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
delayspecifies a delay to wait for, with an optional timeout value.The key
page_navigationsset totruethen waiting for all page navigations will be handled.If
wait_foris not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.The values for the timeout duration are in the object shape
{ secs: 10, nanos: 0 }. - webhooksGet API -object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status.
{ destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool } - user_agentGet API -string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
- sitemapGet API -booleanDefault: false
Include the sitemap results to crawl.
TipThe sitemap allows you to include links that may not be exposed in the HTML. - sitemap_onlyGet API -booleanDefault: false
Only include the sitemap results to crawl.
TipUsing this option allows you to get only the pages on the sitemap without crawling the entire website. - sitemap_pathGet API -stringDefault: sitemap.xml
The sitemap URL to use when using
sitemap. - subdomainsGet API -booleanDefault: false
Allow subdomains to be included.
- tldGet API -booleanDefault: false
Allow TLD's to be included.
- root_selectorGet API -string
The root CSS query selector to use extracting content from the markup for the response.
- preserve_hostGet API -booleanDefault: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
- full_resourcesGet API -boolean
Crawl and download all the resources for a website.
TipCollect all the content from the website, including assets like images, videos, etc. - redirect_policyGet API -stringDefault: LooseLooseStrictNone
The network redirect policy to use when performing HTTP request.
TipLoosewill only capture the initial page redirect to the resource. Include the website inexternal_domainsto allow crawling outside of the domain. - external_domainsGet API -array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to
*to include all domains. - exclude_selectorGet API -string
A CSS query selector to use for ignoring content from the markup of the response.
- concurrency_limitGet API -number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
- execution_scriptsGet API -object
Run custom JavaScript on certain paths. Requires
chromeorsmartrequest mode. The values should be in the shape"/path_or_url": "custom js".TipCustom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page. - disable_interceptGet API -booleanDefault: false
Disable request interception when running request as
chromeorsmart. This may help bypass pages that use third-party scripts or external domains.TipCost and speed may increase when disabling this feature, as it removes native Chrome interception. - block_adsGet API -booleanDefault: true
Block advertisements when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_analyticsGet API -booleanDefault: true
Block analytics when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_stylesheetsGet API -booleanDefault: true
Block stylesheets when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - run_in_backgroundGet API -booleanDefault: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
TipRequiresstoragelessset to false orwebhooksto be enabled. - chunking_algGet API -objectByWordsByLinesByCharacterLengthBySentence
Use a chunking algorithm to segment your content output. Pass an object like
{ "type": "bysentence", "value": 2 }to split the text into an array by every 2 sentences. Works well with markdown or text formats.TipThe chunking algorithm allows you to prepare content for AI without needing extra code or loaders. - budgetGet API -object
Object that has paths with a counter for limiting the amount of pages. Use
{"*":1}for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g.{ "/docs/colors": 10, "/docs/": 100 }.TipThe budget explicitly allows you to set paths and limits for the crawl. - max_credits_per_pageGet API -numberSet the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- max_credits_allowedGet API -numberSet the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- event_trackerGet API -object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following
requestsandresponsesfor the network output of the page.automationwill send detailed information including a screenshot of each automation step used underautomation_scripts. - blacklistGet API -array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
- whitelistGet API -array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
- crawl_timeoutGet API -object
The
crawl_timeoutparameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.The values for the timeout duration are in the object shape
{ secs: 300, nanos: 0 }. - data_connectorsGet API -object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase.
{ s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool }
- return_formatGet API -string | arrayDefault: rawmarkdowncommonmarkrawtextxmlbytesempty
The format to return the data in. Possible values are
markdown,commonmark,raw,text,xml,bytes, andempty. Userawto return the default format of the page likeHTMLetc.TipUsually you want to usemarkdownfor LLM processing ortext. If you need to store the files without losing any encoding, usebytesorraw. PDF transformations may take up to 1 cent per page for high accuracy. - readabilityGet API -booleanDefault: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
TipThis uses the Safari Reader Mode algorithm to extract only important information from the content. - css_extraction_mapGet API -object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
TipYou can scrape using CSS selectors at no extra cost. - link_rewriteGet API -json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a
typefield. Supported types:"replace"– simple substring replacement.
Fields:host?: string(optional) – only apply when the link's host matches this value (e.g."blog.example.com").find: string– substring to search for in the URL.replace_with: string– replacement substring.
"regex"– regex-based rewrite with capture groups.
Fields:host?: string(optional) – only apply for this host.pattern: string– regex applied to the full URL.replace_with: string– replacement string supporting$1,$2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
- clean_htmlGet API -boolean
Clean the HTML of unwanted attributes.
- filter_svgGet API -boolean
Filter SVG elements from the markup.
- filter_imagesGet API -boolean
Filter image elements from the markup.
- filter_main_onlyGet API -booleanDefault: true
Filter the main content from the markup excluding
nav,footer, andasideelements. - return_json_dataGet API -booleanDefault: false
Return the JSON data found in scripts used for SSR.
TipUseful for getting JSON-ready data for LLMs and data from websites built with Next.js etc. - return_headersGet API -booleanDefault: false
Return the HTTP response headers with the results.
TipGetting the HTTP headers can help setup authentication flows. - return_cookiesGet API -booleanDefault: false
Return the HTTP response cookies with the results.
TipGetting the HTTP cookies can help setup authentication SSR flows. - return_page_linksGet API -booleanDefault: false
Return the links found on each page.
TipGetting the links can help index the reference locations found for the resource. - filter_output_svgGet API -boolean
Filter the svg tags from the output.
- filter_output_imagesGet API -boolean
Filter the images from the output.
- filter_output_main_onlyGet API -boolean
Filter the nav, aside, and footer from the output.
- encodingGet API -string
The type of encoding to use like
UTF-8,SHIFT_JIS, or etc. - return_embeddingsGet API -booleanDefault: false
Include OpenAI embeddings for
titleanddescription. Requiresmetadatato be enabled.TipIf you are embedding data, you can use these matrices as staples for most vector baseline operations.
- proxyGet API -'residential' | 'mobile' | 'isp'residentialmobileisp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other
proxy_*shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.TipEach pool carries a different price multiplier (from ×1.2 forresidentialup to ×2 formobile). - remote_proxyGet API -string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
TipUse your own proxy to bypass any firewall as needed or connect to private web servers. - cookiesGet API -string
Add HTTP cookies to use for request.
TipSet the cookie value for pages that use SSR authentication. - headersGet API -object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
TipUsing HTTP headers can help with authenticated pages that use theauthorizationheader field. - fingerprintGet API -booleanDefault: true
Use advanced fingerprint detection for chrome.
TipSet this value to help crawl when websites require a fingerprint. - stealthGet API -booleanDefault: true
Use stealth mode for headless chrome request to help prevent being blocked.
TipSet to true to almost guarantee not being detected by anything. - proxy_enabledGet API -booleanDefault: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
TipUsing this configuration can help when network requests are blocked. This setup increases the cost forfile_costandbytes_transferred_cost, but only by 1.5×.
- cacheGet API -boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true.Accepts either:
true/false- A cache control object:
maxAge(ms) — freshness window (default:172800000= 2 days). Set0for always fetch fresh.allowStale— serve cached results even if stale.period— RFC3339 timestamp cutoff (overridesmaxAge), e.g."2025-11-29T12:00:00Z"skipBrowser— skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
- Standard routes (
/crawl,/scrape,/unblocker) — cache istruewithskipBrowserenabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, setcache: falseor{ "skipBrowser": false }. - AI routes (
/ai/crawl,/ai/scrape, etc.) — cache istruebutskipBrowseris not enabled. AI routes always use the browser to ensure live page content for extraction.
TipCaching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses. - delayGet API -numberDefault: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
TipUsing a delay can help with websites that are set on a cron and do not require immediate data retrieval. - respect_robotsGet API -booleanDefault: true
Respect the robots.txt file for crawling.
TipIf you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value tofalsecould help. Use this config sparingly. - skip_config_checksGet API -booleanDefault: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
- service_worker_enabledGet API -booleanDefault: true
Allow the website to use Service Workers as needed.
TipEnabling service workers can allow websites that explicitly run background tasks to load data. - storagelessGet API -booleanDefault: true
Prevent storing any type of data for the request including storage.
- scrollGet API -number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the
wait_forparameters. Requireschromerequest mode.TipUsewait_forto scroll until a condition is met anddisable_interceptto get data from the network regardless of hostname. - viewportGet API -object
Configure the viewport for chrome.
TipTo emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414). - automation_scriptsGet API -object
Run custom web automated tasks on certain paths. Requires
chromeorsmartrequest mode.Below are the available actions for web automation:- Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" } - Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" } - ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" } - ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } } - ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true } - ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } } - ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } } - ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } } - ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } } - Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 } - WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true } - WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" } - WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } } - WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" } - WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } } - ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 } - ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 } - Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } } - Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } } - InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 } - Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } } - ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
TipCustom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page. - Evaluate: Runs custom JavaScript code.
- evaluate_on_new_documentGet API -stringSet a custom script to evaluate on new document creation.
- country_codeGet API -string
Set a ISO country code for proxy connections. View the locations list for available countries.
TipThe country code allows you to run requests in regions where access to the website is restricted to within that specific region. - localeGet API -string
The locale to use for request, example
en-US.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/json',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/links',
headers=headers, json=json_data)
print(response.json())[ { "url": "https://spider.cloud", "status": 200, "duration_elasped_ms": 112 "error": null }, // more content... ]
Screenshot
View detailsTake screenshots of a website to base64 or binary encoding. You can pass an array of objects for the request body. This endpoint is also available via Proxy-Mode.
https://api.spider.cloud/screenshotResponse
- urlScreenshot API -stringrequired
The URI resource to crawl. This can be a comma split list for multiple URLs.
TipTo reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually. - limitScreenshot API -numberDefault: 0
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0to crawl all pages.TipIt is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found. - disable_hintsScreenshot API -boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating
network_blacklist/network_whitelistrecommendations based on observed request-pattern outcomes). Hints are enabled by default for allsmartrequest modes.Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
TipIf you're tuning filters, keep hints enabled and pair withevent_trackerto see the complete URL list; once stable, you can flipdisable_hintson to lock behavior. - lite_modeScreenshot API -boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
- network_blacklistScreenshot API -string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
- Good targets:
googletagmanager.com,doubleclick.net,maps.googleapis.com - Prefer specific domains over broad substrings to avoid breaking essential assets.
Tip Pair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next. - Good targets:
- network_whitelistScreenshot API -string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
- Start with first-party:
example.com,cdn.example.com - Add only what you observe you truly need (fonts/CDNs), then iterate.
TipPair this withevent_trackerto capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely. - Start with first-party:
- depthScreenshot API -numberDefault: 25
The crawl limit for maximum depth. If
0, no limit will be applied.TipDepth allows you to place a distance between the base URL path and sub paths. - metadataScreenshot API -booleanDefault: false
Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability.
TipUsing metadata can help extract critical information to use for AI. - sessionScreenshot API -booleanDefault: true
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session.
- request_timeoutScreenshot API -numberDefault: 60
The timeout to use for request. Timeouts can be from
5-255seconds.TipThe timeout helps prevent long request times from hanging. - wait_forScreenshot API -object
The
wait_forparameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:The key
idle_networkspecifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.The key
idle_network0specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.The key
almost_idle_network0specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.The key
selectorspecifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
domspecifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.The key
delayspecifies a delay to wait for, with an optional timeout value.The key
page_navigationsset totruethen waiting for all page navigations will be handled.If
wait_foris not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.The values for the timeout duration are in the object shape
{ secs: 10, nanos: 0 }. - webhooksScreenshot API -object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status.
{ destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool } - user_agentScreenshot API -string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
- sitemapScreenshot API -booleanDefault: false
Include the sitemap results to crawl.
TipThe sitemap allows you to include links that may not be exposed in the HTML. - sitemap_onlyScreenshot API -booleanDefault: false
Only include the sitemap results to crawl.
TipUsing this option allows you to get only the pages on the sitemap without crawling the entire website. - sitemap_pathScreenshot API -stringDefault: sitemap.xml
The sitemap URL to use when using
sitemap. - subdomainsScreenshot API -booleanDefault: false
Allow subdomains to be included.
- tldScreenshot API -booleanDefault: false
Allow TLD's to be included.
- root_selectorScreenshot API -string
The root CSS query selector to use extracting content from the markup for the response.
- preserve_hostScreenshot API -booleanDefault: false
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined.
- full_resourcesScreenshot API -boolean
Crawl and download all the resources for a website.
TipCollect all the content from the website, including assets like images, videos, etc. - redirect_policyScreenshot API -stringDefault: LooseLooseStrictNone
The network redirect policy to use when performing HTTP request.
TipLoosewill only capture the initial page redirect to the resource. Include the website inexternal_domainsto allow crawling outside of the domain. - external_domainsScreenshot API -array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to
*to include all domains. - exclude_selectorScreenshot API -string
A CSS query selector to use for ignoring content from the markup of the response.
- concurrency_limitScreenshot API -number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
- execution_scriptsScreenshot API -object
Run custom JavaScript on certain paths. Requires
chromeorsmartrequest mode. The values should be in the shape"/path_or_url": "custom js".TipCustom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page. - disable_interceptScreenshot API -booleanDefault: false
Disable request interception when running request as
chromeorsmart. This may help bypass pages that use third-party scripts or external domains.TipCost and speed may increase when disabling this feature, as it removes native Chrome interception. - block_adsScreenshot API -booleanDefault: true
Block advertisements when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_analyticsScreenshot API -booleanDefault: true
Block analytics when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - block_stylesheetsScreenshot API -booleanDefault: true
Block stylesheets when running request as
chromeorsmart. This can greatly increase performance.TipCost and speed might increase when disabling this feature. - run_in_backgroundScreenshot API -booleanDefault: false
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard.
TipRequiresstoragelessset to false orwebhooksto be enabled. - chunking_algScreenshot API -objectByWordsByLinesByCharacterLengthBySentence
Use a chunking algorithm to segment your content output. Pass an object like
{ "type": "bysentence", "value": 2 }to split the text into an array by every 2 sentences. Works well with markdown or text formats.TipThe chunking algorithm allows you to prepare content for AI without needing extra code or loaders. - budgetScreenshot API -object
Object that has paths with a counter for limiting the amount of pages. Use
{"*":1}for only crawling the root page. The wildcard matches all routes and you can set child paths to limit depth, e.g.{ "/docs/colors": 10, "/docs/": 100 }.TipThe budget explicitly allows you to set paths and limits for the crawl. - max_credits_per_pageScreenshot API -numberSet the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- max_credits_allowedScreenshot API -numberSet the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
- event_trackerScreenshot API -object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following
requestsandresponsesfor the network output of the page.automationwill send detailed information including a screenshot of each automation step used underautomation_scripts. - blacklistScreenshot API -array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
- whitelistScreenshot API -array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
- crawl_timeoutScreenshot API -object
The
crawl_timeoutparameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.The values for the timeout duration are in the object shape
{ secs: 300, nanos: 0 }. - data_connectorsScreenshot API -object
Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase.
{ s3: { bucket, access_key_id, secret_access_key, region?, prefix?, content_type? }, gcs: { bucket, service_account_base64, prefix? }, google_sheets: { spreadsheet_id, service_account_base64, sheet_name? }, azure_blob: { connection_string, container, prefix? }, supabase: { url, anon_key, table }, on_find: bool, on_find_metadata: bool } - full_pageScreenshot API -booleanDefault: true
Take a screenshot of the full page.
- css_extraction_mapScreenshot API -object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
TipYou can scrape using CSS selectors at no extra cost. - link_rewriteScreenshot API -json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a
typefield. Supported types:"replace"– simple substring replacement.
Fields:host?: string(optional) – only apply when the link's host matches this value (e.g."blog.example.com").find: string– substring to search for in the URL.replace_with: string– replacement substring.
"regex"– regex-based rewrite with capture groups.
Fields:host?: string(optional) – only apply for this host.pattern: string– regex applied to the full URL.replace_with: string– replacement string supporting$1,$2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
- clean_htmlScreenshot API -boolean
Clean the HTML of unwanted attributes.
- filter_svgScreenshot API -boolean
Filter SVG elements from the markup.
- filter_imagesScreenshot API -boolean
Filter image elements from the markup.
- filter_main_onlyScreenshot API -booleanDefault: true
Filter the main content from the markup excluding
nav,footer, andasideelements. - return_json_dataScreenshot API -booleanDefault: false
Return the JSON data found in scripts used for SSR.
TipUseful for getting JSON-ready data for LLMs and data from websites built with Next.js etc. - return_headersScreenshot API -booleanDefault: false
Return the HTTP response headers with the results.
TipGetting the HTTP headers can help setup authentication flows. - return_cookiesScreenshot API -booleanDefault: false
Return the HTTP response cookies with the results.
TipGetting the HTTP cookies can help setup authentication SSR flows. - return_page_linksScreenshot API -booleanDefault: false
Return the links found on each page.
TipGetting the links can help index the reference locations found for the resource. - filter_output_svgScreenshot API -boolean
Filter the svg tags from the output.
- filter_output_imagesScreenshot API -boolean
Filter the images from the output.
- filter_output_main_onlyScreenshot API -boolean
Filter the nav, aside, and footer from the output.
- encodingScreenshot API -string
The type of encoding to use like
UTF-8,SHIFT_JIS, or etc. - return_embeddingsScreenshot API -booleanDefault: false
Include OpenAI embeddings for
titleanddescription. Requiresmetadatato be enabled.TipIf you are embedding data, you can use these matrices as staples for most vector baseline operations. - binaryScreenshot API -boolean
Return the image as binary instead of base64.
- cdp_paramsScreenshot API -objectDefault: null
The settings to use to adjust clip, format, quality, and more.
- proxyScreenshot API -'residential' | 'mobile' | 'isp'residentialmobileisp
Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other
proxy_*shorthand configurations. See the pricing table for full details. Alternatively, use Proxy-Mode to route standard HTTP traffic through Spider's proxy endpoint.TipEach pool carries a different price multiplier (from ×1.2 forresidentialup to ×2 formobile). - remote_proxyScreenshot API -string
Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy.
TipUse your own proxy to bypass any firewall as needed or connect to private web servers. - cookiesScreenshot API -string
Add HTTP cookies to use for request.
TipSet the cookie value for pages that use SSR authentication. - headersScreenshot API -object
Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs.
TipUsing HTTP headers can help with authenticated pages that use theauthorizationheader field. - fingerprintScreenshot API -booleanDefault: true
Use advanced fingerprint detection for chrome.
TipSet this value to help crawl when websites require a fingerprint. - stealthScreenshot API -booleanDefault: true
Use stealth mode for headless chrome request to help prevent being blocked.
TipSet to true to almost guarantee not being detected by anything. - proxy_enabledScreenshot API -booleanDefault: false
Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead.
TipUsing this configuration can help when network requests are blocked. This setup increases the cost forfile_costandbytes_transferred_cost, but only by 1.5×.
- cacheScreenshot API -boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }Default: true
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true.Accepts either:
true/false- A cache control object:
maxAge(ms) — freshness window (default:172800000= 2 days). Set0for always fetch fresh.allowStale— serve cached results even if stale.period— RFC3339 timestamp cutoff (overridesmaxAge), e.g."2025-11-29T12:00:00Z"skipBrowser— skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Default behavior by route type:
- Standard routes (
/crawl,/scrape,/unblocker) — cache istruewithskipBrowserenabled by default. Cached pages return instantly without re-launching Chrome. To force a fresh browser fetch, setcache: falseor{ "skipBrowser": false }. - AI routes (
/ai/crawl,/ai/scrape, etc.) — cache istruebutskipBrowseris not enabled. AI routes always use the browser to ensure live page content for extraction.
TipCaching saves costs on repeated runs. Standard routes skip the browser entirely when cached HTML exists, providing instant responses. - delayScreenshot API -numberDefault: 0
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format.
TipUsing a delay can help with websites that are set on a cron and do not require immediate data retrieval. - respect_robotsScreenshot API -booleanDefault: true
Respect the robots.txt file for crawling.
TipIf you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value tofalsecould help. Use this config sparingly. - skip_config_checksScreenshot API -booleanDefault: true
Skip checking the database for website configuration. This will increase performance for requests that use limit=1.
- service_worker_enabledScreenshot API -booleanDefault: true
Allow the website to use Service Workers as needed.
TipEnabling service workers can allow websites that explicitly run background tasks to load data. - storagelessScreenshot API -booleanDefault: true
Prevent storing any type of data for the request including storage.
- block_imagesScreenshot API -booleanDefault: false
Block the images from loading to speed up the screenshot.
- fastScreenshot API -booleanDefault: true
Use fast screenshot mode for speed-optimized rendering. Set to
falsefor high-fidelity rendering that supports iframes, complex PDFs, and accurate visual output. - omit_backgroundScreenshot API -booleanDefault: false
Omit the background from loading.
- scrollScreenshot API -number
Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the
wait_forparameters. Requireschromerequest mode.TipUsewait_forto scroll until a condition is met anddisable_interceptto get data from the network regardless of hostname. - viewportScreenshot API -object
Configure the viewport for chrome.
TipTo emulate a mobile device, set the viewport to a phone device's size (e.g. 375x414). - automation_scriptsScreenshot API -object
Run custom web automated tasks on certain paths. Requires
chromeorsmartrequest mode.Below are the available actions for web automation:- Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" } - Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" } - ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" } - ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } } - ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true } - ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } } - ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } } - ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } } - ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } } - Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 } - WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true } - WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" } - WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } } - WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" } - WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } } - ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 } - ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 } - Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } } - Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } } - InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 } - Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } } - ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
TipCustom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page. - Evaluate: Runs custom JavaScript code.
- evaluate_on_new_documentScreenshot API -stringSet a custom script to evaluate on new document creation.
- country_codeScreenshot API -string
Set a ISO country code for proxy connections. View the locations list for available countries.
TipThe country code allows you to run requests in regions where access to the website is restricted to within that specific region. - localeScreenshot API -string
The locale to use for request, example
en-US.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/json',
}
json_data = {"limit":5,"url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/screenshot',
headers=headers, json=json_data)
print(response.json())[ { "content": "<resource>...", "error": null, "status": 200, "duration_elapsed_ms": 122, "costs": { "ai_cost": 0, "compute_cost": 0.00001, "file_cost": 0.00002, "bytes_transferred_cost": 0.00002, "total_cost": 0.00004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Transform HTML
View detailsTransform HTML into Markdown or plain text quickly. Each HTML transformation starts at 0.1 credits, while PDF transformations can cost up to 10 credits per page. You can submit up to 10 MB of data per request. The Transform API is also integrated into the /crawl endpoint via the return_format parameter.
https://api.spider.cloud/transformResponse
- dataTransform API -objectrequired
A list of html data to transform. The object list takes the keys
htmlandurl. The url key is optional and only used when the readability is enabled.
- return_formatTransform API -string | arrayDefault: rawmarkdowncommonmarkrawtextxmlbytesempty
The format to return the data in. Possible values are
markdown,commonmark,raw,text,xml,bytes, andempty. Userawto return the default format of the page likeHTMLetc.TipUsually you want to usemarkdownfor LLM processing ortext. If you need to store the files without losing any encoding, usebytesorraw. PDF transformations may take up to 1 cent per page for high accuracy. - readabilityTransform API -booleanDefault: false
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
TipThis uses the Safari Reader Mode algorithm to extract only important information from the content. - clean_fullTransform API -booleanDefault: false
Clean the HTML fully of unwanted attributes.
- cleanTransform API -booleanDefault: false
Clean the markdown or text for AI removing footers, navigation, and more.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/json',
}
json_data = {"return_format":"markdown","data":[{"html":"<html><body>\n<h1>Example Website</h1>\n<p>This is some example markup to use to test the transform function.</p>\n<p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</body></html>","url":"https://example.com"}]}
response = requests.post('https://api.spider.cloud/transform',
headers=headers, json=json_data)
print(response.json()){ "content": [ "# Example Website This is some example markup to use to test the transform function. [Guides](https://spider.cloud/guides)" ], "cost": { "ai_cost": 0, "compute_cost": 0, "file_cost": 0, "bytes_transferred_cost": 0, "total_cost": 0, "transform_cost": 0.0001 }, "error": null, "status": 200 }
Proxy-Mode
Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance and residential proxies up to 10GB per/s. Take a look at all of our proxy locations to see if we support the country.
Proxy-Mode works with all core endpoints: Crawl, Scrape, Screenshot, Search, and Links. Pass API parameters in the password field to configure rendering, proxies, and more.
**HTTP address**: proxy.spider.cloud:80**HTTPS address**: proxy.spider.cloud:443**Username**: YOUR-API-KEY**Password**: PARAMETERS- •Residential — real-user IPs across 100+ countries. High anonymity, up to 1 GB/s. $1–4/GB
- •ISP — stable datacenter IPs with ISP-grade routing. Highest performance, up to 10 GB/s. $1/GB
- •Mobile — real 4G/5G device IPs for maximum stealth. $2/GB
Use country_code to set geolocation and proxy to select the pool type.
| Proxy Type | Price | Multiplier | Description |
|---|---|---|---|
| residential | $2.00/GB | ×2-x4 | Entry-level residential pool |
| mobile | $2.00/GB | ×2 | 4G/5G mobile proxies for stealth |
| isp | $1.00/GB | ×1 | ISP-grade residential routing |
import requests, os
# Proxy configuration
proxies = {
'http': f"http://{os.getenv('SPIDER_API_KEY')}:proxy=residential@proxy.spider.cloud:8888",
'https': f"https://{os.getenv('SPIDER_API_KEY')}:proxy=residential@proxy.spider.cloud:8889"
}
# Function to make a request through the proxy
def get_via_proxy(url):
try:
response = requests.get(url, proxies=proxies)
response.raise_for_status()
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
return response.text
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
return None
# Example usage
if __name__ == "__main__":
get_via_proxy("https://www.example.com")
get_via_proxy("https://www.example.com/community")Browser
Spider Browser is a Rust-based cloud browser for automation, scraping, and AI extraction. Connect via the browser.spider.cloud WebSocket endpoint using any Playwright or Puppeteer compatible client, or use the spider-browser TypeScript library for a higher-level API with built-in AI actions.
**WebSocket endpoint**: wss://browser.spider.cloud/v1/browser**Authentication**: ?token=YOUR-API-KEY**Protocol**: CDP WebDriver BiDi- •AI extraction & actions — extract structured data or perform actions with natural language. Vision models handle complex pages.
- •Stealth & proxies — automatic fingerprint rotation, residential proxies, and a retry engine that recovers sessions on its own.
- •100 concurrent browsers — per user on all plans. Pass
stealth,browser, andcountryquery params to configure each session.
Sessions can be recorded and replayed from the dashboard. See the spider-browser repo for full documentation and examples.
import { SpiderBrowser } from "spider-browser"
const browser = new SpiderBrowser({
apiKey: process.env.SPIDER_API_KEY!,
})
await browser.init()
await browser.page.goto("https://example.com")
// extract structured data with AI
const prices = await browser.extract("Get all product prices")
// perform actions with natural language
await browser.act("Add the cheapest item to the cart")
// take a screenshot
const screenshot = await browser.page.screenshot()
await browser.close()import { SpiderBrowser } from "spider-browser"
// Connect via raw CDP WebSocket for full control
const browser = new SpiderBrowser({
apiKey: process.env.SPIDER_API_KEY!,
})
await browser.init()
// Use the CDP session directly
const client = await browser.page.context().newCDPSession(browser.page)
// Enable network interception
await client.send("Network.enable")
client.on("Network.responseReceived", (params) => {
console.log(params.response.url, params.response.status)
})
await browser.page.goto("https://example.com")
await browser.close()import { SpiderBrowser } from "spider-browser"
// Enable session recording — replay later in the dashboard
const browser = new SpiderBrowser({
apiKey: process.env.SPIDER_API_KEY!,
record: true, // screencast + interaction capture
})
await browser.init()
await browser.page.goto("https://example.com")
await browser.act("Click the login button")
await browser.act("Fill in the email field with test@example.com")
// Recording is automatically saved when the session ends
await browser.close()
// View recordings at spider.cloud/account/recordingsQueries
Query the data that you collect during crawling and scraping. Add dynamic filters for extracting exactly what is needed.
Logs
Get the last 24 hours of logs.
https://api.spider.cloud/data/crawl_logsResponse
- urlLogs API -string
Filter a single url record.
- limitLogs API -string
The limit of records to get.
- domainLogs API -string
Filter a single domain record.
- pageLogs API -number
The current page to get.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/crawl_logs?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud',
headers=headers)
print(response.json()){ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "UI", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": null }
Credits
Get the remaining credits available.
https://api.spider.cloud/data/creditsResponse
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/credits?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud',
headers=headers)
print(response.json()){ "data": { "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "credits": 53334, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" } }
Scraper Configs Alpha
Browse optimized scraper configs for popular websites. Each config defines extraction rules (selectors, AI prompts, stealth settings, and more) curated for the best results out of the box.
Scraper Directory Alpha
Browse optimized scraper configs for popular websites. Filter by domain, category, or search term. Each config is curated to deliver the best extraction results out of the box. No authentication required.
https://api.spider.cloud/data/scraper-directoryResponse
- urlScraper API -string
Filter a single url record.
- limitScraper API -string
The limit of records to get.
- domainScraper API -string
Filter a single domain record.
- pageScraper API -number
The current page to get.
import requests
headers = {
'Authorization': 'Bearer $SPIDER_API_KEY',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/scraper-directory?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud',
headers=headers)
print(response.json()){ "data": [ { "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "domain": "example.com", "path_pattern": "/blog/*", "display_name": "Example Blog Scraper", "description": "Extracts blog posts with title, author, and content.", "category": "news", "tags": ["blog", "articles"], "confidence_score": 0.95, "validation_count": 12, "slug": "example-com-blog", "created_at": "2025-12-01T10:00:00+00:00", "updated_at": "2026-01-15T08:30:00+00:00" } ], "total": 1, "page": 1, "limit": 20, "total_pages": 1 }