Building an MCP Server for Web Scraping
AI agents need web access. An MCP server is the cleanest way to give it to them — one implementation, every client.
The Model Context Protocol (MCP) is an open standard for connecting AI models to external tools. Build the scraping server once, and Claude Desktop, VS Code, Cursor, and any other MCP client can use it without additional glue code.
What is MCP?
MCP is an open protocol created by Anthropic for connecting AI models to external tools and data sources. Instead of writing bespoke integrations for every model provider, you build a single MCP server that exposes capabilities (tools, resources, prompts) over a well-defined JSON-RPC transport. Any MCP-compatible client, including Claude Desktop, VS Code with Copilot, Cursor, and custom agent frameworks, can discover and call those tools without additional glue code.
The protocol uses a client-server architecture:
- MCP Server: Exposes tools, resources, and prompts. Runs as a local process or remote service.
- MCP Client: Discovers available tools and invokes them on behalf of the model.
- Transport: Connects client and server over stdio (local) or HTTP with Server-Sent Events (remote).
This separation means you can build a web scraping MCP server today and immediately use it from any client that speaks the protocol.
Why Spider as the backend
MCP tools need to be fast. When a model calls a tool mid-conversation, the user is waiting. A scraping backend that returns in sub-second feels instant; one that takes 10+ seconds breaks the interaction.
We are using Spider’s API for this tutorial because it returns clean markdown by default — no post-processing needed before feeding content into an LLM context window. It also handles proxy rotation and anti-bot bypass, which means the MCP server does not need its own infrastructure for dealing with Cloudflare or CAPTCHAs.
Any scraping API that returns clean text would work here. The MCP implementation is the same regardless of backend. Spider happens to be fast enough that tool calls feel instant to the user.
Project Setup
Initialize a new TypeScript project and install the dependencies:
mkdir spider-mcp-server && cd spider-mcp-server
npm init -y
npm install @modelcontextprotocol/sdk zod
npm install -D typescript @types/node
Create a tsconfig.json:
{
"compilerOptions": {
"target": "ES2022",
"module": "Node16",
"moduleResolution": "Node16",
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"declaration": true
},
"include": ["src/**/*"]
}
Update your package.json to include the build script and binary entry point:
{
"name": "spider-mcp-server",
"version": "1.0.0",
"type": "module",
"bin": {
"spider-mcp-server": "./dist/index.js"
},
"scripts": {
"build": "tsc",
"start": "node dist/index.js"
}
}
Defining the Tools
Our MCP server will expose four tools that cover the most common web data retrieval patterns:
| Tool | Purpose |
|---|---|
crawl_url | Start from a URL, follow links, return content for multiple pages |
scrape_page | Fetch a single URL and return its content |
search_web | Search the web and optionally scrape the results |
extract_data | Pull structured fields from a page using natural language |
Each tool maps directly to a Spider API endpoint, so the MCP server acts as a thin, typed interface between the model and the scraping engine.
Full Implementation
Create src/index.ts. This is the complete MCP server:
#!/usr/bin/env node
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
const SPIDER_API_KEY = process.env.SPIDER_API_KEY;
const SPIDER_BASE_URL = "https://api.spider.cloud";
if (!SPIDER_API_KEY) {
console.error("SPIDER_API_KEY environment variable is required");
process.exit(1);
}
// ── Spider API client ───────────────────────────────────────────────
interface SpiderRequestOptions {
endpoint: string;
body: Record<string, unknown>;
}
async function callSpiderAPI({ endpoint, body }: SpiderRequestOptions): Promise<unknown> {
const response = await fetch(`${SPIDER_BASE_URL}${endpoint}`, {
method: "POST",
headers: {
"Authorization": `Bearer ${SPIDER_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify(body),
});
if (!response.ok) {
const errorText = await response.text();
throw new Error(
`Spider API error (${response.status}): ${errorText}`
);
}
return response.json();
}
// ── MCP Server ──────────────────────────────────────────────────────
const server = new McpServer({
name: "spider-web-scraping",
version: "1.0.0",
});
// Tool: crawl_url
server.tool(
"crawl_url",
"Crawl a website starting from a URL. Follows links and returns content for each discovered page. Use this when you need to gather content from multiple pages on the same site.",
{
url: z.string().url().describe("The starting URL to crawl"),
limit: z
.number()
.int()
.min(1)
.max(500)
.default(10)
.describe("Maximum number of pages to crawl (default: 10)"),
return_format: z
.enum(["markdown", "html", "text"])
.default("markdown")
.describe("Output format for page content (default: markdown)"),
request: z
.enum(["http", "chrome", "smart"])
.default("smart")
.describe(
"Request mode: http (static pages), chrome (JS-rendered), smart (auto-detect)"
),
},
async ({ url, limit, return_format, request }) => {
try {
const result = await callSpiderAPI({
endpoint: "/crawl",
body: { url, limit, return_format, request },
});
const pages = result as Array<{ url: string; content: string }>;
const summary = pages
.map(
(page, i) =>
`## Page ${i + 1}: ${page.url}\n\n${page.content || "(no content)"}`
)
.join("\n\n---\n\n");
return {
content: [
{
type: "text" as const,
text: `Crawled ${pages.length} page(s) from ${url}:\n\n${summary}`,
},
],
};
} catch (error) {
return {
content: [
{
type: "text" as const,
text: `Crawl failed: ${error instanceof Error ? error.message : String(error)}`,
},
],
isError: true,
};
}
}
);
// Tool: scrape_page
server.tool(
"scrape_page",
"Scrape a single web page and return its content. Use this when you need content from one specific URL.",
{
url: z.string().url().describe("The URL to scrape"),
return_format: z
.enum(["markdown", "html", "text"])
.default("markdown")
.describe("Output format (default: markdown)"),
request: z
.enum(["http", "chrome", "smart"])
.default("smart")
.describe("Request mode (default: smart)"),
},
async ({ url, return_format, request }) => {
try {
const result = await callSpiderAPI({
endpoint: "/crawl",
body: { url, limit: 1, return_format, request },
});
const pages = result as Array<{ url: string; content: string }>;
const page = pages[0];
return {
content: [
{
type: "text" as const,
text: page?.content || "No content returned for this URL.",
},
],
};
} catch (error) {
return {
content: [
{
type: "text" as const,
text: `Scrape failed: ${error instanceof Error ? error.message : String(error)}`,
},
],
isError: true,
};
}
}
);
// Tool: search_web
server.tool(
"search_web",
"Search the web using a query and return results with their content. Use this when you need to find pages about a topic rather than scraping a known URL.",
{
query: z.string().describe("The search query"),
limit: z
.number()
.int()
.min(1)
.max(50)
.default(5)
.describe("Maximum number of results to return (default: 5)"),
return_format: z
.enum(["markdown", "html", "text"])
.default("markdown")
.describe("Output format for scraped content (default: markdown)"),
fetch_page_content: z
.boolean()
.default(true)
.describe(
"Whether to fetch and return the full page content for each result (default: true)"
),
},
async ({ query, limit, return_format, fetch_page_content }) => {
try {
const result = await callSpiderAPI({
endpoint: "/search",
body: {
search: query,
limit,
return_format,
fetch_page_content,
},
});
const pages = result as Array<{
url: string;
title?: string;
description?: string;
content?: string;
}>;
const formatted = pages
.map((page, i) => {
const header = `## Result ${i + 1}: ${page.title || page.url}`;
const url = `**URL:** ${page.url}`;
const desc = page.description
? `**Description:** ${page.description}`
: "";
const content = page.content
? `\n\n${page.content}`
: "\n\n(content not fetched)";
return [header, url, desc, content].filter(Boolean).join("\n");
})
.join("\n\n---\n\n");
return {
content: [
{
type: "text" as const,
text: `Found ${pages.length} result(s) for "${query}":\n\n${formatted}`,
},
],
};
} catch (error) {
return {
content: [
{
type: "text" as const,
text: `Search failed: ${error instanceof Error ? error.message : String(error)}`,
},
],
isError: true,
};
}
}
);
// Tool: extract_data
server.tool(
"extract_data",
"Extract structured data from a web page using a natural language prompt. Returns JSON. Use this when you need specific fields (prices, names, dates, etc.) pulled from a page.",
{
url: z.string().url().describe("The URL to extract data from"),
prompt: z
.string()
.describe(
"Natural language description of what data to extract, e.g. 'Extract all product names and prices'"
),
request: z
.enum(["http", "chrome", "smart"])
.default("smart")
.describe("Request mode (default: smart)"),
},
async ({ url, prompt, request }) => {
try {
const result = await callSpiderAPI({
endpoint: "/crawl",
body: {
url,
limit: 1,
return_format: "markdown",
request,
extra_ai_data: true,
prompt,
},
});
const pages = result as Array<{
url: string;
content: string;
extra_ai_data?: string;
}>;
const page = pages[0];
const extracted = page?.extra_ai_data || page?.content;
return {
content: [
{
type: "text" as const,
text: extracted || "No data could be extracted from this URL.",
},
],
};
} catch (error) {
return {
content: [
{
type: "text" as const,
text: `Extraction failed: ${error instanceof Error ? error.message : String(error)}`,
},
],
isError: true,
};
}
}
);
// ── Start the server ────────────────────────────────────────────────
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Spider MCP server running on stdio");
}
main().catch((error) => {
console.error("Fatal error:", error);
process.exit(1);
});
Build the server:
npm run build
That produces dist/index.js, a standalone Node.js script that speaks MCP over stdio.
How the Code Works
Let us walk through the key decisions.
Transport Layer
The server uses StdioServerTransport, which means the MCP client spawns the server as a child process and communicates over stdin/stdout. This is the simplest transport for local development and is what Claude Desktop, VS Code, and Cursor all expect for locally installed MCP servers.
For remote deployments (shared team servers, hosted infrastructure), you would swap to the SSE transport. The tool definitions and handlers stay identical.
Tool Definitions with Zod
Each server.tool() call takes a Zod schema that defines the tool’s parameters. The MCP SDK converts these schemas into JSON Schema for the client, which means the AI model sees typed parameter descriptions and can call tools correctly without guessing at field names or types.
The schemas also validate incoming parameters before your handler runs. If the model passes a string where a number is expected, the SDK rejects the call with a clear error rather than letting it propagate to the Spider API.
Error Handling Pattern
Every handler wraps its Spider API call in a try/catch and returns isError: true on failure. This is important for MCP. When a tool returns an error, the model knows the call failed and can decide whether to retry, try a different approach, or report the failure to the user. Throwing an unhandled exception would crash the server process instead.
The callSpiderAPI Wrapper
A single function handles all communication with Spider’s API: authentication, JSON serialization, and HTTP error checking. If you later want to add retries, request logging, or response caching, there is exactly one place to do it.
Connecting to MCP Clients
Claude Desktop
Edit your Claude Desktop configuration file.
On macOS, open ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"spider": {
"command": "node",
"args": ["/absolute/path/to/spider-mcp-server/dist/index.js"],
"env": {
"SPIDER_API_KEY": "your-spider-api-key"
}
}
}
}
On Windows, the file is at %APPDATA%\Claude\claude_desktop_config.json.
Restart Claude Desktop. The spider tools will appear in the tool list, and Claude will call them when a conversation requires web data.
VS Code (Copilot Chat)
Add the server to your VS Code settings.json or workspace .vscode/mcp.json:
{
"mcp": {
"servers": {
"spider": {
"command": "node",
"args": ["/absolute/path/to/spider-mcp-server/dist/index.js"],
"env": {
"SPIDER_API_KEY": "your-spider-api-key"
}
}
}
}
}
Cursor
Cursor reads MCP configuration from ~/.cursor/mcp.json:
{
"mcpServers": {
"spider": {
"command": "node",
"args": ["/absolute/path/to/spider-mcp-server/dist/index.js"],
"env": {
"SPIDER_API_KEY": "your-spider-api-key"
}
}
}
}
Claude Code (CLI)
If you use Claude Code in the terminal, add the server to your project’s .mcp.json at the repository root:
{
"mcpServers": {
"spider": {
"command": "node",
"args": ["/absolute/path/to/spider-mcp-server/dist/index.js"],
"env": {
"SPIDER_API_KEY": "your-spider-api-key"
}
}
}
}
In every case, the pattern is the same: point the client at the compiled JavaScript file and pass the API key as an environment variable. The MCP protocol handles discovery, so the client automatically learns what tools are available.
Testing the Server
You can test the server directly using the MCP Inspector, a debugging tool included in the SDK:
npx @modelcontextprotocol/inspector node dist/index.js
This opens a web UI where you can call each tool manually and inspect the JSON-RPC messages flowing between client and server. It is the fastest way to verify that your tool definitions, parameter validation, and response formatting all work before connecting a real model.
You can also test with a quick stdin script:
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | \
SPIDER_API_KEY=your-key node dist/index.js 2>/dev/null
This sends a tools/list request and prints the server’s response, which should include all four tool definitions.
Production Considerations
The server above works for personal use and small teams. For production deployments, there are a few areas to harden.
Rate Limiting
Spider supports up to 50,000 requests per minute, but your MCP server should still implement its own rate limiting. A model in a tight loop can generate tool calls faster than you want to spend credits.
A simple approach using a token bucket:
class RateLimiter {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number = 60,
private refillRate: number = 60 // tokens per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
async acquire(): Promise<void> {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
if (this.tokens < 1) {
const waitTime = (1 - this.tokens) / this.refillRate;
await new Promise((resolve) => setTimeout(resolve, waitTime * 1000));
this.tokens = 0;
} else {
this.tokens -= 1;
}
}
}
const limiter = new RateLimiter(60, 1); // 60 requests per minute
// In callSpiderAPI, add:
// await limiter.acquire();
Response Caching
If the model asks about the same URL multiple times in a conversation, you can avoid redundant API calls with a simple TTL cache:
const cache = new Map<string, { data: unknown; expiry: number }>();
const CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes
function getCached(key: string): unknown | null {
const entry = cache.get(key);
if (!entry) return null;
if (Date.now() > entry.expiry) {
cache.delete(key);
return null;
}
return entry.data;
}
function setCache(key: string, data: unknown): void {
cache.set(key, { data, expiry: Date.now() + CACHE_TTL_MS });
}
Build cache keys from the endpoint and request body so that identical requests hit the cache. For scrape_page and extract_data in particular, caching prevents the model from re-scraping the same page when it decides to ask follow-up questions about it.
Credit Controls
Spider supports max_credits_per_page and max_credits_allowed parameters that cap spending on a per-request basis. Adding these to your callSpiderAPI wrapper prevents a single tool call from consuming an unexpected amount of credits:
const DEFAULT_CREDIT_LIMITS = {
max_credits_per_page: 5,
max_credits_allowed: 50,
};
// Merge into every request body:
body: { ...DEFAULT_CREDIT_LIMITS, ...body }
Logging
MCP servers communicate over stdio, so you cannot log to stdout without corrupting the protocol stream. Use console.error() for all diagnostic output, or write to a file:
import { appendFileSync } from "fs";
function log(message: string): void {
const timestamp = new Date().toISOString();
appendFileSync("/tmp/spider-mcp.log", `${timestamp} ${message}\n`);
}
This gives you a persistent log you can tail while debugging tool calls in real time.
Content Truncation
Model context windows are finite. A crawl_url call with limit: 100 could return megabytes of markdown, which would overflow the context and break the conversation. Truncate large responses to a reasonable size:
const MAX_CONTENT_LENGTH = 50_000; // characters
function truncate(text: string): string {
if (text.length <= MAX_CONTENT_LENGTH) return text;
return (
text.slice(0, MAX_CONTENT_LENGTH) +
"\n\n[Content truncated. Request fewer pages or a specific URL for full content.]"
);
}
Apply this in each tool handler before returning the response.
Extending the Server
The four tools above cover the most common patterns, but Spider’s API has more endpoints you can wrap as additional tools:
/screenshot: Capture full-page screenshots as base64 PNG. Useful for visual analysis or when the model needs to “see” a page layout./links: Return the link graph for a URL. Useful for sitemap discovery or finding related pages before deciding what to crawl./pipeline/extract-contacts: Extract emails, phone numbers, and social profiles from a page.
Each additional tool follows the same pattern: define a Zod schema, write a handler that calls callSpiderAPI, and format the response as MCP content blocks.
You can also add MCP resources (read-only data the model can reference) and prompts (reusable prompt templates). For example, a resource that exposes your Spider account’s credit balance, or a prompt template for common extraction patterns.
What You Have Built
At this point you have a production-capable MCP server that gives any AI model four core web data capabilities:
- Crawl entire sites and get clean markdown back.
- Scrape individual pages on demand.
- Search the web and return results with full page content.
- Extract structured JSON from any page using natural language.
The server works with Claude Desktop, VS Code, Cursor, Claude Code, and any other client that supports the MCP protocol. It validates inputs with Zod and handles errors without crashing the server process.
MCP is still early. The spec is evolving, SSE transport is being reworked, and most clients still only support stdio. But even in its current form, wrapping a scraping API as an MCP server took under 200 lines of tool definitions and immediately worked across four different clients. That is worth the bet.
A few things to keep in mind: pin your @modelcontextprotocol/sdk version in package.json — the SDK is pre-1.0 and breaking changes happen between minor releases. Add a health check that tests the Spider API on startup so the MCP server fails fast if credentials are wrong. And if you extend this with additional tools, consider publishing the server as a package so others can use it too.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.