Most scraping workflows start the same way: you open a terminal and try to pull data from a URL. Maybe you reach for curl, maybe a Python script. It works for one page. Then you need a hundred pages. Then ten thousand. Suddenly you’re dealing with rate limits, anti-bot detection, retries, and output formatting, all before you’ve touched the actual data you wanted.
The Spider CLI is built to handle that entire curve. One binary, written in Rust, that goes from scraping a single page to crawling entire domains with cloud-backed proxy rotation and anti-bot bypass. No Python environment to manage, no Docker containers to spin up.
This guide walks through every crawl mode with real output from real sites. By the end, you’ll know exactly which mode to reach for and why.
Install
Spider ships as a single Rust binary. If you have Cargo installed:
cargo install spider_cli
Verify it’s working:
spider --version
That’s it. No runtime dependencies, no config files, no setup wizard.
Your first crawl
The simplest thing you can do is crawl a site and see what links it finds:
spider -u https://choosealicense.com --limit 5 -v crawl -o
The -v flag shows each fetch as it happens, and -o prints the discovered links to stdout:
fetch https://choosealicense.com
fetch https://choosealicense.com/licenses/gpl-3.0/
fetch https://choosealicense.com/licenses/
fetch https://choosealicense.com/licenses/mit/
fetch https://choosealicense.com/community/
https://choosealicense.com
https://choosealicense.com/community/
https://choosealicense.com/licenses/
https://choosealicense.com/licenses/mit/
https://choosealicense.com/licenses/gpl-3.0/
Five URLs, fetched concurrently, in under a second. The crawler discovered all outbound links from the homepage and followed them up to the limit. This is the foundation everything else builds on.
Scraping: getting the content
Crawling finds links. Scraping gets the content. The scrape command returns structured JSONL with the page content, status code, and URL for every page:
spider -u https://quotes.toscrape.com --limit 1 \
--return-format markdown \
scrape --output-html
{
"content": "Quotes to Scrape\n# Quotes to Scrape\n\"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\"\nby Albert Einstein\n...",
"headers": null,
"links": null,
"status_code": 200,
"url": "https://quotes.toscrape.com"
}
The --return-format flag controls the output transformation. Your options:
| Format | What you get | Good for |
|---|---|---|
raw | Original HTML | Parsing with your own tools |
markdown | Clean markdown | LLM context windows, RAG pipelines |
commonmark | Strict CommonMark | Documentation ingestion |
text | Plain text, no markup | Search indexing, text analysis |
xml | XML structure | Structured data pipelines |
For most AI workflows, markdown is the right choice. It strips navigation, scripts, and styling while preserving the content hierarchy. The output is clean enough to drop directly into a prompt.
Note: When you run with
--spider-cloud-mode, themarkdownformat is processed server-side by Spider Cloud, which produces noticeably cleaner output than the local transformation. The cloud pipeline applies smarter content detection, better nav/footer stripping, and tighter whitespace handling. If you’re feeding content into an LLM or embedding pipeline, the cloud markdown is worth the difference.
Controlling what gets crawled
Page limits
The --limit flag caps the total number of pages:
spider -u https://books.toscrape.com --limit 10 crawl -o
This crawls up to 10 pages and stops. Simple, predictable, good for testing.
Depth limits
Depth controls how far from the starting URL the crawler will go. A depth of 1 means “only pages linked from the homepage.” A depth of 2 means “pages linked from those pages, too.”
spider -u https://books.toscrape.com --limit 10 -d 2 -v crawl -o
fetch https://books.toscrape.com
fetch https://books.toscrape.com/index.html
fetch https://books.toscrape.com/catalogue/page-2.html
fetch https://books.toscrape.com/catalogue/page-3.html
fetch https://books.toscrape.com/catalogue/page-1.html
Depth is how you avoid crawling a 50,000-page site when you only need the top-level content. Set --limit for a hard cap and -d for structural control. Use both together for precise targeting.
Crawl budgets
Budgets give you path-level control. The syntax is path,limit pairs:
spider -u https://books.toscrape.com -B "catalogue,5" crawl -o
This allows up to 5 pages under /catalogue and unlimited pages elsewhere. You can set a global budget with *:
spider -u https://quotes.toscrape.com -B "*,1" \
--return-format markdown \
scrape --output-html
That fetches exactly one page, no matter how many links the crawler finds. Budgets are useful when you know the site structure and want specific sections without over-crawling.
Blacklisting paths
If there are paths you never want to touch:
spider -u https://example.com \
--blacklist-url "/admin,/api,/internal" \
crawl -o
Comma-separated URL patterns. Any link matching these patterns gets skipped.
Going remote with Spider Cloud
Local crawling hits a ceiling fast. Your residential IP gets blocked after a few hundred requests. JavaScript-heavy sites need a real browser. Anti-bot systems like Cloudflare and DataDome return challenge pages instead of content. You can try to solve each of these yourself, but you’ll end up building and maintaining proxy infrastructure, browser pools, and retry logic that has nothing to do with the data you actually want.
Spider Cloud handles all of that behind the same CLI you already know. One flag changes where your requests are routed. The commands stay identical, but now you have 200M+ rotating proxies across 199 countries, automatic CAPTCHA solving, headless browser rendering, and smarter content extraction. You don’t run any of that infrastructure. You just add --spider-cloud-mode and keep working.
Authenticate once
spider authenticate sk-your-api-key
This stores your key at ~/.spider/credentials. Every subsequent command picks it up automatically. You can also use the SPIDER_CLOUD_API_KEY environment variable, or pass --spider-cloud-key per command.
The key resolution order:
--spider-cloud-keyflag (explicit, per-command)SPIDER_CLOUD_API_KEYenvironment variable- Stored credentials from
spider authenticate
Sign up at spider.cloud to get an API key. You get 2,500 free credits on signup, no card required.
Cloud modes
This is where the CLI gets powerful. The --spider-cloud-mode flag controls how your requests are routed through Spider’s infrastructure:
Proxy mode (default)
spider -u https://example.com --spider-cloud-mode proxy crawl -o
Routes your request through Spider’s rotating proxy network. Your IP stays hidden, you get automatic geographic rotation across 199 countries. This is the default because it works for most sites and costs the least.
When to use it: General crawling where you just need a clean IP. News sites, blogs, public documentation, e-commerce product pages.
Smart mode
spider -u https://example.com --spider-cloud-mode smart --limit 50 crawl -o
Smart mode inspects each response and decides the cheapest way to get the data. Static pages go through HTTP proxies. Pages with bot protection automatically escalate to browser rendering. Pages with CAPTCHAs get vision AI solving.
You don’t configure any of this. The system figures it out per-page.
When to use it: Mixed sites where some pages are simple HTML and others have JavaScript rendering or bot checks. This is the mode you want for most production scraping because it optimizes cost automatically.
API mode
spider -u https://example.com --spider-cloud-mode api \
--return-format markdown \
scrape --output-html
Sends your request directly to Spider’s /crawl API endpoint. Full feature access: AI extraction, screenshots, custom headers, everything the REST API supports.
When to use it: When you need the full power of the API but prefer working from the terminal. Scripting batch jobs, CI/CD pipelines, or when you want consistent behavior with your API-based workflows.
Unblocker mode
spider -u https://protected-site.com --spider-cloud-mode unblocker \
scrape --output-html
Dedicated anti-bot bypass. Every request goes through antidetect browsers with fingerprint rotation, automatic CAPTCHA solving, and JavaScript execution. Heavier than proxy mode, but gets through protections that simpler approaches can’t.
When to use it: Sites with aggressive bot detection. Cloudflare Under Attack mode, DataDome, PerimeterX, hCaptcha walls. If proxy mode returns 403s or challenge pages, switch to unblocker.
Fallback mode
spider -u https://example.com --spider-cloud-mode fallback \
--limit 100 \
crawl -o
Tries a direct fetch first. If the response comes back as a 403, 429, or 503, it automatically retries through Spider’s proxy infrastructure. This saves money on sites that don’t block you while still handling the ones that do.
When to use it: Crawling a list of mixed domains where some will block you and some won’t. You don’t want to pay for proxy overhead on sites that serve content freely, but you don’t want failures on the ones that don’t.
Browser rendering
Some pages don’t exist until JavaScript runs. SPAs, client-rendered dashboards, lazy-loaded content. For these, add --spider-cloud-browser:
spider -u https://example.com \
--spider-cloud-browser --headless \
--return-format markdown \
scrape --output-html
This spins up a remote headless Chrome instance through Spider Browser Cloud. The page renders fully, JavaScript executes, and then the content gets extracted. Combine it with any cloud mode:
spider -u https://example.com \
--spider-cloud-mode smart \
--spider-cloud-browser --headless \
--wait-for-idle-network 5000 \
scrape --output-html
The --wait-for-idle-network flag waits until there are no network requests for 500ms (with a 5-second timeout). This catches lazy-loaded content that fires after the initial page load.
Other wait strategies for tricky pages:
# Wait for a specific element to appear
--wait-for-selector ".product-grid"
# Wait for the DOM to stop changing
--wait-for-idle-dom
# Wait a fixed delay (last resort)
--wait-for-delay 3000
Practical patterns
Scrape a site to markdown for RAG
spider -u https://docs.example.com \
--limit 500 \
--return-format markdown \
--respect-robots-txt \
scrape --output-html > docs.jsonl
Each line of docs.jsonl is a JSON object with the markdown content, URL, and status code. Pipe it into your embedding pipeline.
Crawl with a polite delay
spider -u https://example.com \
--limit 1000 \
-D 500 \
--respect-robots-txt \
crawl -o > links.txt
The -D 500 flag adds a 500ms delay between requests. Combined with --respect-robots-txt, this is the responsible way to crawl sites that haven’t explicitly invited scraping.
Scrape through a proxy
If you’re already running your own proxy infrastructure:
spider -u https://example.com \
-p http://user:pass@proxy.example.com:8080 \
scrape --output-html
The -p flag routes all requests through your proxy. Works with HTTP and SOCKS proxies.
Connect to an existing Chrome instance
If you’re already running Chrome with remote debugging:
spider -u https://example.com \
--chrome-connection-url ws://127.0.0.1:9222 \
--headless \
scrape --output-html
This connects to your existing Chrome DevTools Protocol endpoint instead of launching a new browser. Useful for debugging or for environments where Chrome is already running as a service.
Stealth crawling
spider -u https://protected-site.com \
--stealth \
--headless \
scrape --output-html
The --stealth flag enables anti-detection measures: randomized fingerprints, proper navigator properties, WebGL spoofing, and other techniques that make automated Chrome look like a real browser.
Include cookies for authenticated scraping
spider -u https://example.com/dashboard \
--cookie "session=abc123; token=xyz789" \
--headless \
scrape --output-html
Pass your session cookies directly. This is how you scrape pages that require authentication, like logged-in dashboards or member-only content.
Crawl across subdomains
spider -u https://example.com \
--subdomains \
--limit 1000 \
crawl -o
The --subdomains flag allows the crawler to follow links to blog.example.com, docs.example.com, and any other subdomain it discovers. Without this flag, the crawler stays strictly on the exact domain you specify.
Group external domains into one crawl
spider -u https://example.com \
-E "docs.example.com,blog.example.com" \
--limit 500 \
crawl -o
The -E flag treats the specified external domains as part of the same crawl. Links between them get followed instead of ignored.
Piping output into your stack
The CLI outputs structured JSONL, which means it plays well with standard Unix tools:
# Count pages by status code
spider -u https://example.com --limit 100 \
scrape --output-html | jq -r '.status_code' | sort | uniq -c
# Extract just the URLs
spider -u https://example.com --limit 100 \
scrape --output-html | jq -r '.url'
# Get markdown content for pages that returned 200
spider -u https://example.com --limit 100 \
--return-format markdown \
scrape --output-html | jq -r 'select(.status_code == 200) | .content'
Feed this into whatever comes next: an embedding model, a database, a search index, a file system. The CLI handles the crawling; you handle the data.
Local vs. cloud: when to upgrade
The local CLI is genuinely useful on its own. For small crawls, testing, or sites that serve content without any protection, you don’t need anything else. But here’s what changes when you add Spider Cloud:
| Local | Cloud | |
|---|---|---|
| IP rotation | Your single IP | 200M+ IPs across 199 countries |
| Anti-bot bypass | None (you get blocked) | Automatic, per-request |
| Browser rendering | Requires local Chrome | Remote headless Chrome, no install |
| CAPTCHA solving | Manual | Vision AI, automatic |
| Markdown quality | Local transformation | Server-side pipeline, cleaner output |
| Concurrent scale | Limited by your machine | Limited by your plan |
| Cost | Free + your bandwidth/compute | Pay-per-page, starting under $0.001 |
The inflection point is usually around the first 403 or the first site that needs JavaScript. Once you’re past “scrape a few static pages,” the cloud mode pays for itself in time you don’t spend debugging proxy setups and browser configurations.
When to use which mode
If you’re not sure which combination of flags to reach for, here’s the decision tree:
“I need links from a site” → crawl -o with --limit and -d
“I need page content” → scrape --output-html with --return-format markdown
“Sites are blocking me” → Add --spider-cloud-mode smart after authenticating
“Pages need JavaScript” → Add --spider-cloud-browser --headless
“I’m scraping millions of pages” → --spider-cloud-mode smart handles cost optimization automatically
“I need to be polite” → -D 500 --respect-robots-txt
“I need clean data for AI” → --spider-cloud-mode smart --return-format markdown for the best output quality
For most real workloads, the fastest path is: install the CLI, authenticate with Spider Cloud, and use smart mode. You can always drop back to local for quick tests.
What’s next
The Spider CLI ships updates alongside the core library. Upcoming features include WARC archiving for compliance workflows and direct integration with Spider’s AI extraction models for structured data from unstructured pages.
The CLI source code is MIT-licensed. If something doesn’t work the way you expect, open an issue or submit a PR.
Get started:
cargo install spider_cli
spider authenticate sk-your-api-key
spider -u https://your-target.com --spider-cloud-mode smart --return-format markdown scrape --output-html