Scraping and Crawling

Web scrape or crawl reliably to get data from anywhere.

Scraping Single Page

Once you have an API key, use any of our SDK libraries or direct API. The following are settings to try first:

Single Page Scrape Using API in Python

import requests import os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "request": "smart", # Automatically decides which mode to use "return_format": "markdown", # LLM friendly format "proxy_enabled": True } response = requests.post( 'https://api.spider.cloud/scrape', # set to scrape endpoint headers=headers, json=params ) print(response.json())

Crawling

To crawl a starting URL use the crawl endpoint, or use the crawl_url method in the SDK library. Set max limit of pages to crawl. When testing your code with a website that potentially may have thousands of links, it's useful to set a maximum limit that is reasonable before crawling more (e.g. limit of 30 pages).

Crawling Pages Using API in Python

import requests import os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "limit": 30, # Maximum number of pages to crawl "depth": 3, # Reasonable depth for small sites "request": "smart", "return_format": "markdown" } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params ) print(response.json())

Example Response from Crawling and Scraping

import requests import os # ... truncated code response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params ) for result in response: print(result['content']) # Main HTML content or text print(result['url']) # URL of the page print(result['status']) # HTTP Status code print(result['error']) # Error message if available print(result['costs']) # Cost breakdown in USD for the page

Request Types

The following request types are supported

  • smart (default): Automatically determines whether to use http or chrome based on heuristics or when javascript is needed to render the page.
  • http: Performs basic HTTP request. This is the fastest and most cost-efficient option, ideal for pages with static content or simple HTML responses.
  • chrome: Uses a headless Chrome browser to fetch the page. For pages that need JavaScript rendering or need to run interactions on the page. This may be slower in comparison to http/smart modes.

Streaming Responses

Use streaming to take full advantage of the crawling speed and process your requests as they are finished.