Scraping and Crawling

Web scrape or crawl reliably to get data from anywhere.

Scraping Single Page

Once you have an API key, use any of our SDK libraries or direct API. The following are settings to try first:

Single Page Scrape Using API in Python

import requests
import os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

params = {
	"url": "https://www.example.com",
	"request": "smart", # Automatically decides which mode to use
	"return_format": "markdown", # LLM friendly format
	"proxy_enabled": True
}

response = requests.post(
    'https://api.spider.cloud/scrape', # set to scrape endpoint
    headers=headers,
    json=params
)

print(response.json())

Pro Tip:

We recommend testing with a variety of URLs and assess the responses you get back.

Crawling

To crawl a starting URL use the crawl endpoint, or use the crawl_url method in the SDK library. Set max limit of pages to crawl. When testing your code with a website that potentially may have thousands of links, it's useful to set a maximum limit that is reasonable before crawling more (e.g. limit of 30 pages).

Crawling Pages Using API in Python

import requests
import os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

params = {
	"url": "https://www.example.com",
	"limit": 30, # Maximum number of pages to crawl
	"depth": 3, # Reasonable depth for small sites
	"request": "smart",
	"return_format": "markdown"
}

response = requests.post(
    'https://api.spider.cloud/crawl',
    headers=headers,
    json=params
)

print(response.json())

Example Response from Crawling and Scraping

import requests
import os

# ... truncated code

response = requests.post(
    'https://api.spider.cloud/crawl',
    headers=headers,
    json=params
)

for result in response:
    print(result['content'])        # Main HTML content or text
    print(result['url'])            # URL of the page
    print(result['status'])         # HTTP Status code
    print(result['error'])          # Error message if available
    print(result['costs'])          # Cost breakdown in USD for the page

Request Types

The following request types are supported

smart (default): Automatically determines whether to use http or chrome based on heuristics or when javascript is needed to render the page.
http: Performs basic HTTP request. This is the fastest and most cost-efficient option, ideal for pages with static content or simple HTML responses.
chrome: Uses a headless Chrome browser to fetch the page. For pages that need JavaScript rendering or need to run interactions on the page. This may be slower in comparison to http/smart modes.

Streaming Responses

Use streaming to take full advantage of the crawling speed and process your requests as they are finished.