API Reference
Download OpenAPI Specification: Download
The Spider API is based on REST. Our API is predictable, returns JSON-encoded responses, uses standard HTTP response codes, authentication, and verbs. Set your API secret key in the authorization
header to commence with the format Bearer $TOKEN
. You can use the content-type
header with application/json
, application/xml
, text/csv
, and application/jsonl
for shaping the response.
The Spider API varies for each account as we release new versions and tailor functionality. You can add v1
before any path to lock in that version. Executing a request on the page by pressing the Run button will consume live credits and treat the response as a genuine result. The system is constantly improving to ensure you can handle the dynamic aspects of the web. Spider provides all the tools you need to collect data from any website.
Just getting started?
Check out our development quickstart guide.
Not a developer?
Use Spiders no-code options or apps to get started with Spider and to do more with your Spider account no code required.
https://api.spider.cloud
Client libraries
Crawl websites
Start crawling a website or websites to collect resources. You can pass an array of objects for the request body.
POSThttps://api.spider.cloud/crawl
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
request string
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults tosmart
.The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
tld boolean
Allow TLD's to be included. Defaults to
false
.
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/crawl',
headers=headers, json=json_data)
print(response.json())
[ { "content": "<resource>...", "error": null, "status": 200, "costs": { "ai_cost": 0, "compute_cost": 0.0001, "file_cost": 0.0002, "bytes_transferred_cost": 0.0002, "total_cost": 0.0004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Search
Perform a search and gather a list of websites to start crawling and collect resources.
POSThttps://api.spider.cloud/search
Request body
search required string
The search query you want to search for.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
search_limit number
The limit amount of urls to fetch or crawl from the search results. Remove the value or set it to
0
to crawl all URLs from the realtime search results. This is a shorthand if you do not want to usenum
.fetch_page_content boolean
Fetch all the content of the websites by performing crawls. The Defaults to
true
; if this is disabled, only the search results are returned instead.request string
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults tosmart
.The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
country string
The country code to use for the search. It's a two-letter country code. (e.g.
us
for the United States).location string
The location from where you want the search to originate.
language string
The language to use for the search. It's a two-letter language code (e.g.,
en
for English).
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
json_data = {"search":"a sports website","search_limit":3,"limit":5,"return_format":"markdown"}
response = requests.post('https://api.spider.cloud/search',
headers=headers, json=json_data)
print(response.json())
[ { "content": "<resource>...", "error": null, "status": 200, "costs": { "ai_cost": 0, "compute_cost": 0.0001, "file_cost": 0.0002, "bytes_transferred_cost": 0.0002, "total_cost": 0.0004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Crawl websites get links
Start crawling a website(s) to collect links found. You can pass an array of objects for the request body.
POSThttps://api.spider.cloud/links
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
request string
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults tosmart
.The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
tld boolean
Allow TLD's to be included. Defaults to
false
.
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/links',
headers=headers, json=json_data)
print(response.json())
[ { "url": "https://spider.cloud", "status": 200, "error": null }, // more content... ]
Screenshot websites
Take screenshots to base64 or binary encoding.
POSThttps://api.spider.cloud/screenshot
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
tld boolean
Allow TLD's to be included. Defaults to
false
.depth number
The crawl limit for maximum depth. If
0
, no limit will be applied.Depth allows you to place a distance between the base URL path and subfolders.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
encoding string
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.Perform the encoding on the server when you know in advance the type of website.
return_headers boolean
Return the HTTP response headers with the results. Defaults to
false
unless you have the website already stored with the configuration enabled.Getting the HTTP headers can help setup authentication flows.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
json_data = {"limit":5,"url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/screenshot',
headers=headers, json=json_data)
print(response.json())
[ { "content": "<resource>...", "error": null, "status": 200, "costs": { "ai_cost": 0, "compute_cost": 0.0001, "file_cost": 0.0002, "bytes_transferred_cost": 0.0002, "total_cost": 0.0004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Transform HTML
Transform HTML to Markdown or text fast. Each HTML transformation costs 1 credit. You can send up to 10MB of data at once. The transform API is also built into the /crawl
endpoint by using return_format
.
POSThttps://api.spider.cloud/transform
Request body
data required object
A list of html data to transform. The object list takes the keys
html
andurl
. The url key is optional and only used when the readability is enabled.
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
json_data = {"return_format":"markdown","data":[{"html":"<html>\n<head>\n <title>Example Transform</title> \n</head>\n<body>\n<div>\n <h1>Example Website</h1>\n <p>This is some example markup to use to test the transform function.</p>\n <p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</div>\n</body></html>","url":"https://example.com"}]}
response = requests.post('https://api.spider.cloud/transform',
headers=headers, json=json_data)
print(response.json())
{ "content": [ "Example Domain Example Domain ========== This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. [More information...](https://www.iana.org/domains/example)" ], "cost": { "ai_cost": 0, "compute_cost": 0, "file_cost": 0, "bytes_transferred_cost": 0, "total_cost": 0, "transform_cost": 0.0001 }, "error": null, "status": 200 }
Query
Query a resource from the global database instead of crawling a website. 1 credit per successful retrieval.
POSThttps://api.spider.cloud/data/query
Request body
url string
The exact path of the url that you want to get.
domain string
The website domain you want to query.
pathname string
The website pathname you want to query.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
json_data = {"url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/data/query',
headers=headers, json=json_data)
print(response.json())
{ "content": "<html> <body> <div> <h1>Example Website</h1> </div> </body> </html>", "error": null, "status": 200 }
Proxy-Mode Alpha
Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance residential proxies up to 1TB per/s.
**HTTP address**: proxy.spider.cloud:8888
**HTTPS address**: proxy.spider.cloud:8889
**Username**: YOUR-API-KEY
**Password**: PARAMETERS
import requests, os
# Proxy configuration
proxies = {
'http': f"http://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8888",
'https': f"https://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8889"
}
# Function to make a request through the proxy
def get_via_proxy(url):
try:
response = requests.get(url, proxies=proxies)
response.raise_for_status()
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
return response.text
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
return None
# Example usage
if __name__ == "__main__":
get_via_proxy("https://www.example.com")
get_via_proxy("https://www.example.com/community")
Pipelines
Create powerful workflows with our pipeline API endpoints. Use AI to extract leads from any website or filter links with prompts with ease.
Extract leads
Start crawling a website(s) to collect all leads utilizing AI. A minimum of $25 in credits is necessary for extraction.
POSThttps://api.spider.cloud/pipeline/extract-contacts
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
request string
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults tosmart
.The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
tld boolean
Allow TLD's to be included. Defaults to
false
.
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/pipeline/extract-contacts',
headers=headers, json=json_data)
print(response.json())
[ { "content": "<resource>...", "error": null, "status": 200, "costs": { "ai_cost": 0, "compute_cost": 0.0001, "file_cost": 0.0002, "bytes_transferred_cost": 0.0002, "total_cost": 0.0004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Label website
Crawl a website and accurately categorize it using AI.
POSThttps://api.spider.cloud/pipeline/label
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
request string
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults tosmart
.The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
tld boolean
Allow TLD's to be included. Defaults to
false
.
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/pipeline/label',
headers=headers, json=json_data)
print(response.json())
[ { "content": "<resource>...", "error": null, "status": 200, "costs": { "ai_cost": 0, "compute_cost": 0.0001, "file_cost": 0.0002, "bytes_transferred_cost": 0.0002, "total_cost": 0.0004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Crawl websites from text
Crawl website(s) found from raw text or markdown.
POSThttps://api.spider.cloud/pipeline/crawl-text
Request body
text required string
The text string to extract urls from. The max limit for the text is 10mb.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
request string
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults tosmart
.The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
tld boolean
Allow TLD's to be included. Defaults to
false
.
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"text":"Check this link: https://example.com and email to example@email.com","limit":5,"return_format":"markdown"}
response = requests.post('https://api.spider.cloud/pipeline/crawl-text',
headers=headers, json=json_data)
print(response.json())
[ { "content": "<resource>...", "error": null, "status": 200, "costs": { "ai_cost": 0, "compute_cost": 0.0001, "file_cost": 0.0002, "bytes_transferred_cost": 0.0002, "total_cost": 0.0004, "transform_cost": 0.0001 }, "url": "https://spider.cloud" }, // more content... ]
Filter links
Filter links using AI and advanced metadata.
POSThttps://api.spider.cloud/pipeline/filter-links
Request body
url required array
The urls to filter.
You can pass up to 4k tokens for the links and prompt.
model string
The type of AI model to use like
gpt-4o
,gpt-4o-mini
,gpt-4-1106-preview
orgpt-3.5-turbo
etc. Defaults togpt-4o-mini
.custom_func string
A custom function to run before processing. Contact us for getting a custom function tailored to you.
custom_prompt string
A custom prompt to pass in to the model.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/pipeline/filter-links',
headers=headers, json=json_data)
print(response.json())
{ "content": [ { "relevant_urls": [ "https://spider.cloud", "https://foodnetwork.com" ] } ], "cost": { "ai_cost": 0.0005, "compute_cost": 0, "file_cost": 0, "bytes_transferred_cost": 0, "total_cost": 0, "transform_cost": 0 }, "error": "", "status": 200 }
Questions and Answers
Get a question-and-answer list for a website based on any inquiry.
POSThttps://api.spider.cloud/pipeline/extract-qa
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
model string
The type of AI model to use like
gpt-4o
,gpt-4o-mini
,gpt-4-1106-preview
orgpt-3.5-turbo
etc. Defaults togpt-4o-mini
.custom_func string
A custom function to run before processing. Contact us for getting a custom function tailored to you.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
encoding string
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.Perform the encoding on the server when you know in advance the type of website.
return_headers boolean
Return the HTTP response headers with the results. Defaults to
false
unless you have the website already stored with the configuration enabled.Getting the HTTP headers can help setup authentication flows.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/pipeline/extract-qa',
headers=headers, json=json_data)
print(response.json())
{ "content": [ { "answer": "Spider is a data collecting solution designed for web crawling and scraping.", "question": "What is the primary function of Spider?" }, { "answer": "You can kickstart your data collecting projects by signing up for a free trial or taking advantage of the promotional credits offered.", "question": "How can I get started with Spider?" }, { "answer": "Spider offers unmatched speed, scalability, and comprehensive data curation, making it trusted by leading tech businesses.", "question": "What are the benefits of using Spider for data collection?" }, { "answer": "Spider can easily crawl, search, and extract data from various sources, including social media platforms.", "question": "What kind of data can Spider extract?" }, { "answer": "Spider is built fully in Rust for next-generation scalability.", "question": "What programming language is Spider built with?" } ], "cost": { "ai_cost": 0.0009, "compute_cost": 0.0001, "file_cost": 0, "bytes_transferred_cost": 0, "total_cost": 0, "transform_cost": 0 }, "error": null, "status": 200, "url": "https://spider.cloud" }
Queries
Query the data that you collect during crawling and scraping. Add dynamic filters for extracting exactly what is needed.
Websites Collection
Get the websites stored.
GEThttps://api.spider.cloud/data/websites
Request params
limit string
The limit of records to get.
url string
Filter a single url record.
page number
The current page to get.
domain string
Filter a single domain record.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/websites?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": [ { "id": "2a503c02-f161-444b-b1fa-03a3914667b6", "user_id": "6bd06efa-bb0b-4f1f-a23f-05db0c4b1bfd", "url": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd/example.com/index.html", "domain": "spider.cloud", "created_at": "2024-04-18T15:40:25.667063+00:00", "updated_at": "2024-04-18T15:40:25.667063+00:00", "pathname": "/", "fts": "", "scheme": "https:", "last_checked_at": "2024-05-10T13:39:32.293017+00:00", "screenshot": null } ], "count": 100 }
Pages Collection
Get the pages/resources stored.
GEThttps://api.spider.cloud/data/pages
Request params
limit string
The limit of records to get.
url string
Filter a single url record.
page number
The current page to get.
domain string
Filter a single domain record.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/pages?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": [ { "id": "733b0d0f-e406-4229-949d-8068ade54752", "user_id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "url": "https://spider.cloud", "domain": "spider.cloud", "created_at": "2024-04-17T01:28:15.016975+00:00", "updated_at": "2024-04-17T01:28:15.016975+00:00", "proxy": true, "headless": true, "crawl_budget": null, "scheme": "https:", "last_checked_at": "2024-04-17T01:28:15.016975+00:00", "full_resources": false, "metadata": true, "gpt_config": null, "smart_mode": false, "fts": "'spider.cloud':1" } ], "count": 100 }
Pages Metadata Collection
Get the pages metadata/resources stored.
GEThttps://api.spider.cloud/data/pages_metadata
Request params
limit string
The limit of records to get.
url string
Filter a single url record.
page number
The current page to get.
domain string
Filter a single domain record.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/pages_metadata?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": [ { "id": "e27a1995-2abe-4319-acd1-3dd8258f0f49", "user_id": "253524cd-3f94-4ed1-83b3-f7fab134c3ff", "url": "253524cd-3f94-4ed1-83b3-f7fab134c3ff/www.google.com/search?query=spider.cloud.html", "domain": "www.google.com", "resource_type": "html", "title": "spider.cloud - Google Search", "description": "", "file_size": 1253960, "embedding": null, "pathname": "/search", "created_at": "2024-05-18T17:40:16.4808+00:00", "updated_at": "2024-05-18T17:40:16.4808+00:00", "keywords": [ "Fastest Web Crawler spider", "Web scraping" ], "labels": "Search Engine", "extracted_data": null, "fts": "'/search':1" } ], "count": 100 }
Leads Collection
Get the pages contacts stored.
GEThttps://api.spider.cloud/data/contacts
Request params
limit string
The limit of records to get.
url string
Filter a single url record.
page number
The current page to get.
domain string
Filter a single domain record.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/contacts?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": [ { "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker" } ], "count": 100 }
Crawl State
Get the state of the crawl for the domain.
GEThttps://api.spider.cloud/data/crawl_state
Request params
limit string
The limit of records to get.
url string
Filter a single url record.
page number
The current page to get.
domain string
Filter a single domain record.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/crawl_state?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": null }
Crawl Logs
Get the last 24 hours of logs.
GEThttps://api.spider.cloud/data/crawl_logs
Request params
limit string
The limit of records to get.
url string
Filter a single url record.
page number
The current page to get.
domain string
Filter a single domain record.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/crawl_logs?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": null }
Credits
Get the remaining credits available.
GEThttps://api.spider.cloud/data/credits
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/credits?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": { "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "credits": 53334, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" } }
Crons
Get the cron jobs that are set to keep data fresh.
GEThttps://api.spider.cloud/data/crons
Request params
limit string
The limit of records to get.
url string
Filter a single url record.
page number
The current page to get.
domain string
Filter a single domain record.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/crons?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
User Profile
Get the profile of the user. This returns data such as approved limits and usage for the month.
GEThttps://api.spider.cloud/data/profile
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/crawl?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": { "id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "email": "user@gmail.com", "stripe_id": "cus_OYO2rAhSQaYqHT", "is_deleted": null, "proxy": null, "headless": false, "billing_limit": 50, "billing_limit_soft": 120, "approved_usage": 0, "crawl_budget": { "*": 200 }, "usage": null, "has_subscription": false, "depth": null, "full_resources": false, "meta_data": true, "billing_allowed": false, "initial_promo": false } }
User-Agents
Get a real user agent to use for crawling.
GEThttps://api.spider.cloud/data/user_agents
Request params
limit string
The limit of records to get.
os string
Filter a by a device ex:
Android
,Mac OS
,Android
,Windows
,Linux
and more.page number
The current page to get.
platform string
Filter a by a platform ex:
Chrome
,Edge
,Safari
,Firefox
and more.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/user_agents?limit%3D5%26return_format%3Dmarkdown',
headers=headers)
print(response.json())
{ "data": { "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36", "platform": "Chrome", "platform_version": "123.0.0.0", "device": "Macintosh", "os": "Mac OS", "os_version": "10.15.7", "cpu_architecture": "", "mobile": false, "device_type": "desktop" } }
Download file
Download a resource from storage.
GEThttps://api.spider.cloud/data/download
Request body
url string
The exact path of the url that you want to get.
domain string
The website domain you want to query.
pathname string
The website pathname you want to query.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.get('https://api.spider.cloud/data/download?',
headers=headers)
print(response.json())
{ "data": "<file>" }
Manage
Configure data to enhance crawl efficiency: create, update, and delete records.
Website
Create or update a website by configuration.
POSThttps://api.spider.cloud/data/websites
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages. Defaults to0
.It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
store_data boolean
Decide whether to store data. If enabled, this option overrides
storageless
. The default setting isfalse
. Storage costs are 0.30 cents per gigabyte per month.Set to
true
to collect resources to download and re-use later on. *Please note: Storing data incurs fee of $0.30/gigabytes/month or daily $0.01.
request string
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults tosmart
.The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to
false
.Using metadata can help extract critical information to use for AI.
tld boolean
Allow TLD's to be included. Defaults to
false
.
return_format string | array
The format to return the data in. Possible values are
markdown
,commonmark
,raw
,text
,xml
,bytes
, andempty
. Useraw
to return the default format of the page likeHTML
etc. Defaults toraw
.Usually you want to use
markdown
for LLM processing ortext
. If you need to store the files without losing any encoding, use thebytes
orraw
format.readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to
false
.This uses the Safari Reader Mode algorithm to extract only important information from the content.
css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
anti_bot boolean
Enable anti-bot mode using various techniques to increase the chance of success. Defaults to
true
.This config will attempt to make request resemble a real human. If the request fails on chrome it will retry using a virtual display for the request that is slower difficult to block at the cost of speed.
cookies string
Add HTTP cookies to use for request.
Set the cookie value for pages that use SSR authentication.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. This defaults to
true
.Set to
true
to almost guarantee not being detected by anything.
cache boolean
Use HTTP caching for the crawl to speed up repeated runs. Defaults to
true
.Enabling caching can save costs when you need to perform transformations on different files or handle various events on a website.
delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to
0
, which indicates it is disabled.Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
storageless boolean
Prevent storing any type of data for the request including storage. Defaults to
true
unless you have the website already stored.
scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the
wait_for
parameters. You also need to ensure that the request is made usingchrome
.Use the
wait_for
configuration to scroll until anddisable_intercept
to make sure you get data from the network regardless of hostname.device object
Configure the device for chrome. One of
mobile
,tablet
, ordesktop
. Defaults todesktop
.viewport object
Configure the viewport for chrome. Defaults to
800x600
.If you need to get data from a website as a mobile, set the viewport to a phone device's size ex:
375x414
.
locale string
The locale to use for request, example
en-US
.country_code string
Set a ISO country code for proxy connections.
The country code allows you to run requests in regions where access to the website is restricted to within that specific region.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/crawl',
headers=headers, json=json_data)
print(response.json())
{ "data": null }
Website
Delete a website from your collection. Remove the url
body to delete all websites.
DELETEhttps://api.spider.cloud/data/websites
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/data/websites',
headers=headers, json=json_data)
print(response.json())
{ "data": null }
Pages
Delete a web page from your collection. Remove the url
body to delete all pages.
DELETEhttps://api.spider.cloud/data/pages
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/data/pages',
headers=headers, json=json_data)
print(response.json())
{ "data": null }
Pages Metadata
Delete a web page metadata from your collection. Remove the url
body to delete all pages metadata.
DELETEhttps://api.spider.cloud/data/pages_metadata
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/data/pages_metadata',
headers=headers, json=json_data)
print(response.json())
{ "data": null }
Leads
Delete a contact or lead from your collection. Remove the url
body to delete all contacts.
DELETEhttps://api.spider.cloud/data/contacts
Request body
url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}
response = requests.post('https://api.spider.cloud/data/contacts',
headers=headers, json=json_data)
print(response.json())
{ "data": null }