Blog / How to Scrape the Web at Scale from Your Terminal

How to Scrape the Web at Scale from Your Terminal

A hands-on guide to using the Spider CLI for web crawling, scraping, and data extraction. Real examples, every crawl mode explained, and how to go from one page to millions without leaving your terminal.

11 min read Jeff Mendez

Most scraping workflows start the same way: you open a terminal and try to pull data from a URL. Maybe you reach for curl, maybe a Python script. It works for one page. Then you need a hundred pages. Then ten thousand. Suddenly you’re dealing with rate limits, anti-bot detection, retries, and output formatting, all before you’ve touched the actual data you wanted.

The Spider CLI is built to handle that entire curve. One binary, written in Rust, that goes from scraping a single page to crawling entire domains with cloud-backed proxy rotation and anti-bot bypass. No Python environment to manage, no Docker containers to spin up.

This guide walks through every crawl mode with real output from real sites. By the end, you’ll know exactly which mode to reach for and why.

Install

Spider ships as a single Rust binary. If you have Cargo installed:

cargo install spider_cli

Verify it’s working:

spider --version

That’s it. No runtime dependencies, no config files, no setup wizard.

Your first crawl

The simplest thing you can do is crawl a site and see what links it finds:

spider -u https://choosealicense.com --limit 5 -v crawl -o

The -v flag shows each fetch as it happens, and -o prints the discovered links to stdout:

fetch https://choosealicense.com
fetch https://choosealicense.com/licenses/gpl-3.0/
fetch https://choosealicense.com/licenses/
fetch https://choosealicense.com/licenses/mit/
fetch https://choosealicense.com/community/
https://choosealicense.com
https://choosealicense.com/community/
https://choosealicense.com/licenses/
https://choosealicense.com/licenses/mit/
https://choosealicense.com/licenses/gpl-3.0/

Five URLs, fetched concurrently, in under a second. The crawler discovered all outbound links from the homepage and followed them up to the limit. This is the foundation everything else builds on.

Scraping: getting the content

Crawling finds links. Scraping gets the content. The scrape command returns structured JSONL with the page content, status code, and URL for every page:

spider -u https://quotes.toscrape.com --limit 1 \
  --return-format markdown \
  scrape --output-html
{
  "content": "Quotes to Scrape\n# Quotes to Scrape\n\"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\"\nby Albert Einstein\n...",
  "headers": null,
  "links": null,
  "status_code": 200,
  "url": "https://quotes.toscrape.com"
}

The --return-format flag controls the output transformation. Your options:

FormatWhat you getGood for
rawOriginal HTMLParsing with your own tools
markdownClean markdownLLM context windows, RAG pipelines
commonmarkStrict CommonMarkDocumentation ingestion
textPlain text, no markupSearch indexing, text analysis
xmlXML structureStructured data pipelines

For most AI workflows, markdown is the right choice. It strips navigation, scripts, and styling while preserving the content hierarchy. The output is clean enough to drop directly into a prompt.

Note: When you run with --spider-cloud-mode, the markdown format is processed server-side by Spider Cloud, which produces noticeably cleaner output than the local transformation. The cloud pipeline applies smarter content detection, better nav/footer stripping, and tighter whitespace handling. If you’re feeding content into an LLM or embedding pipeline, the cloud markdown is worth the difference.

Controlling what gets crawled

Page limits

The --limit flag caps the total number of pages:

spider -u https://books.toscrape.com --limit 10 crawl -o

This crawls up to 10 pages and stops. Simple, predictable, good for testing.

Depth limits

Depth controls how far from the starting URL the crawler will go. A depth of 1 means “only pages linked from the homepage.” A depth of 2 means “pages linked from those pages, too.”

spider -u https://books.toscrape.com --limit 10 -d 2 -v crawl -o
fetch https://books.toscrape.com
fetch https://books.toscrape.com/index.html
fetch https://books.toscrape.com/catalogue/page-2.html
fetch https://books.toscrape.com/catalogue/page-3.html
fetch https://books.toscrape.com/catalogue/page-1.html

Depth is how you avoid crawling a 50,000-page site when you only need the top-level content. Set --limit for a hard cap and -d for structural control. Use both together for precise targeting.

Crawl budgets

Budgets give you path-level control. The syntax is path,limit pairs:

spider -u https://books.toscrape.com -B "catalogue,5" crawl -o

This allows up to 5 pages under /catalogue and unlimited pages elsewhere. You can set a global budget with *:

spider -u https://quotes.toscrape.com -B "*,1" \
  --return-format markdown \
  scrape --output-html

That fetches exactly one page, no matter how many links the crawler finds. Budgets are useful when you know the site structure and want specific sections without over-crawling.

Blacklisting paths

If there are paths you never want to touch:

spider -u https://example.com \
  --blacklist-url "/admin,/api,/internal" \
  crawl -o

Comma-separated URL patterns. Any link matching these patterns gets skipped.

Going remote with Spider Cloud

Local crawling hits a ceiling fast. Your residential IP gets blocked after a few hundred requests. JavaScript-heavy sites need a real browser. Anti-bot systems like Cloudflare and DataDome return challenge pages instead of content. You can try to solve each of these yourself, but you’ll end up building and maintaining proxy infrastructure, browser pools, and retry logic that has nothing to do with the data you actually want.

Spider Cloud handles all of that behind the same CLI you already know. One flag changes where your requests are routed. The commands stay identical, but now you have 200M+ rotating proxies across 199 countries, automatic CAPTCHA solving, headless browser rendering, and smarter content extraction. You don’t run any of that infrastructure. You just add --spider-cloud-mode and keep working.

Authenticate once

spider authenticate sk-your-api-key

This stores your key at ~/.spider/credentials. Every subsequent command picks it up automatically. You can also use the SPIDER_CLOUD_API_KEY environment variable, or pass --spider-cloud-key per command.

The key resolution order:

  1. --spider-cloud-key flag (explicit, per-command)
  2. SPIDER_CLOUD_API_KEY environment variable
  3. Stored credentials from spider authenticate

Sign up at spider.cloud to get an API key. You get 2,500 free credits on signup, no card required.

Cloud modes

This is where the CLI gets powerful. The --spider-cloud-mode flag controls how your requests are routed through Spider’s infrastructure:

Proxy mode (default)

spider -u https://example.com --spider-cloud-mode proxy crawl -o

Routes your request through Spider’s rotating proxy network. Your IP stays hidden, you get automatic geographic rotation across 199 countries. This is the default because it works for most sites and costs the least.

When to use it: General crawling where you just need a clean IP. News sites, blogs, public documentation, e-commerce product pages.

Smart mode

spider -u https://example.com --spider-cloud-mode smart --limit 50 crawl -o

Smart mode inspects each response and decides the cheapest way to get the data. Static pages go through HTTP proxies. Pages with bot protection automatically escalate to browser rendering. Pages with CAPTCHAs get vision AI solving.

You don’t configure any of this. The system figures it out per-page.

When to use it: Mixed sites where some pages are simple HTML and others have JavaScript rendering or bot checks. This is the mode you want for most production scraping because it optimizes cost automatically.

API mode

spider -u https://example.com --spider-cloud-mode api \
  --return-format markdown \
  scrape --output-html

Sends your request directly to Spider’s /crawl API endpoint. Full feature access: AI extraction, screenshots, custom headers, everything the REST API supports.

When to use it: When you need the full power of the API but prefer working from the terminal. Scripting batch jobs, CI/CD pipelines, or when you want consistent behavior with your API-based workflows.

Unblocker mode

spider -u https://protected-site.com --spider-cloud-mode unblocker \
  scrape --output-html

Dedicated anti-bot bypass. Every request goes through antidetect browsers with fingerprint rotation, automatic CAPTCHA solving, and JavaScript execution. Heavier than proxy mode, but gets through protections that simpler approaches can’t.

When to use it: Sites with aggressive bot detection. Cloudflare Under Attack mode, DataDome, PerimeterX, hCaptcha walls. If proxy mode returns 403s or challenge pages, switch to unblocker.

Fallback mode

spider -u https://example.com --spider-cloud-mode fallback \
  --limit 100 \
  crawl -o

Tries a direct fetch first. If the response comes back as a 403, 429, or 503, it automatically retries through Spider’s proxy infrastructure. This saves money on sites that don’t block you while still handling the ones that do.

When to use it: Crawling a list of mixed domains where some will block you and some won’t. You don’t want to pay for proxy overhead on sites that serve content freely, but you don’t want failures on the ones that don’t.

Browser rendering

Some pages don’t exist until JavaScript runs. SPAs, client-rendered dashboards, lazy-loaded content. For these, add --spider-cloud-browser:

spider -u https://example.com \
  --spider-cloud-browser --headless \
  --return-format markdown \
  scrape --output-html

This spins up a remote headless Chrome instance through Spider Browser Cloud. The page renders fully, JavaScript executes, and then the content gets extracted. Combine it with any cloud mode:

spider -u https://example.com \
  --spider-cloud-mode smart \
  --spider-cloud-browser --headless \
  --wait-for-idle-network 5000 \
  scrape --output-html

The --wait-for-idle-network flag waits until there are no network requests for 500ms (with a 5-second timeout). This catches lazy-loaded content that fires after the initial page load.

Other wait strategies for tricky pages:

# Wait for a specific element to appear
--wait-for-selector ".product-grid"

# Wait for the DOM to stop changing
--wait-for-idle-dom

# Wait a fixed delay (last resort)
--wait-for-delay 3000

Practical patterns

Scrape a site to markdown for RAG

spider -u https://docs.example.com \
  --limit 500 \
  --return-format markdown \
  --respect-robots-txt \
  scrape --output-html > docs.jsonl

Each line of docs.jsonl is a JSON object with the markdown content, URL, and status code. Pipe it into your embedding pipeline.

Crawl with a polite delay

spider -u https://example.com \
  --limit 1000 \
  -D 500 \
  --respect-robots-txt \
  crawl -o > links.txt

The -D 500 flag adds a 500ms delay between requests. Combined with --respect-robots-txt, this is the responsible way to crawl sites that haven’t explicitly invited scraping.

Scrape through a proxy

If you’re already running your own proxy infrastructure:

spider -u https://example.com \
  -p http://user:pass@proxy.example.com:8080 \
  scrape --output-html

The -p flag routes all requests through your proxy. Works with HTTP and SOCKS proxies.

Connect to an existing Chrome instance

If you’re already running Chrome with remote debugging:

spider -u https://example.com \
  --chrome-connection-url ws://127.0.0.1:9222 \
  --headless \
  scrape --output-html

This connects to your existing Chrome DevTools Protocol endpoint instead of launching a new browser. Useful for debugging or for environments where Chrome is already running as a service.

Stealth crawling

spider -u https://protected-site.com \
  --stealth \
  --headless \
  scrape --output-html

The --stealth flag enables anti-detection measures: randomized fingerprints, proper navigator properties, WebGL spoofing, and other techniques that make automated Chrome look like a real browser.

Include cookies for authenticated scraping

spider -u https://example.com/dashboard \
  --cookie "session=abc123; token=xyz789" \
  --headless \
  scrape --output-html

Pass your session cookies directly. This is how you scrape pages that require authentication, like logged-in dashboards or member-only content.

Crawl across subdomains

spider -u https://example.com \
  --subdomains \
  --limit 1000 \
  crawl -o

The --subdomains flag allows the crawler to follow links to blog.example.com, docs.example.com, and any other subdomain it discovers. Without this flag, the crawler stays strictly on the exact domain you specify.

Group external domains into one crawl

spider -u https://example.com \
  -E "docs.example.com,blog.example.com" \
  --limit 500 \
  crawl -o

The -E flag treats the specified external domains as part of the same crawl. Links between them get followed instead of ignored.

Piping output into your stack

The CLI outputs structured JSONL, which means it plays well with standard Unix tools:

# Count pages by status code
spider -u https://example.com --limit 100 \
  scrape --output-html | jq -r '.status_code' | sort | uniq -c

# Extract just the URLs
spider -u https://example.com --limit 100 \
  scrape --output-html | jq -r '.url'

# Get markdown content for pages that returned 200
spider -u https://example.com --limit 100 \
  --return-format markdown \
  scrape --output-html | jq -r 'select(.status_code == 200) | .content'

Feed this into whatever comes next: an embedding model, a database, a search index, a file system. The CLI handles the crawling; you handle the data.

Local vs. cloud: when to upgrade

The local CLI is genuinely useful on its own. For small crawls, testing, or sites that serve content without any protection, you don’t need anything else. But here’s what changes when you add Spider Cloud:

LocalCloud
IP rotationYour single IP200M+ IPs across 199 countries
Anti-bot bypassNone (you get blocked)Automatic, per-request
Browser renderingRequires local ChromeRemote headless Chrome, no install
CAPTCHA solvingManualVision AI, automatic
Markdown qualityLocal transformationServer-side pipeline, cleaner output
Concurrent scaleLimited by your machineLimited by your plan
CostFree + your bandwidth/computePay-per-page, starting under $0.001

The inflection point is usually around the first 403 or the first site that needs JavaScript. Once you’re past “scrape a few static pages,” the cloud mode pays for itself in time you don’t spend debugging proxy setups and browser configurations.

When to use which mode

If you’re not sure which combination of flags to reach for, here’s the decision tree:

“I need links from a site”crawl -o with --limit and -d

“I need page content”scrape --output-html with --return-format markdown

“Sites are blocking me” → Add --spider-cloud-mode smart after authenticating

“Pages need JavaScript” → Add --spider-cloud-browser --headless

“I’m scraping millions of pages”--spider-cloud-mode smart handles cost optimization automatically

“I need to be polite”-D 500 --respect-robots-txt

“I need clean data for AI”--spider-cloud-mode smart --return-format markdown for the best output quality

For most real workloads, the fastest path is: install the CLI, authenticate with Spider Cloud, and use smart mode. You can always drop back to local for quick tests.

What’s next

The Spider CLI ships updates alongside the core library. Upcoming features include WARC archiving for compliance workflows and direct integration with Spider’s AI extraction models for structured data from unstructured pages.

The CLI source code is MIT-licensed. If something doesn’t work the way you expect, open an issue or submit a PR.

Get started:

cargo install spider_cli
spider authenticate sk-your-api-key
spider -u https://your-target.com --spider-cloud-mode smart --return-format markdown scrape --output-html

Empower any project with
AI-ready data

Join thousands of developers using Spider to power their data pipelines.