Blog / Building AI Agents That Browse the Web

Building AI Agents That Browse the Web

Architecture patterns and working code for web-browsing AI agents. Covers research, monitoring, and data extraction agents using CrewAI and AutoGen with Spider as the scraping backend.

16 min read Jeff Mendez

Building AI Agents That Browse the Web

Production AI agents browse the web. The hard parts are not the LLM reasoning — they are the infrastructure: fetching pages reliably at the speed the agent needs, keeping token costs under control, and handling the inevitable failures when target sites block you or return garbage.

This post covers three agent patterns that hold up past the demo stage, with implementations in CrewAI and AutoGen. Each pattern uses Spider as the scraping backend, though the architecture applies regardless of which API you use.

Why agents need a different kind of scraper

A traditional scraper fetches one page, parses it, and stores the result. An agent does something fundamentally different. It decides which pages to fetch based on intermediate reasoning, issues requests in tight loops, and feeds the content directly into an LLM for processing. That changes the requirements.

Latency matters more than you think. An agent making 40 requests per task cannot afford 30 seconds per page. At that rate, a single research question takes 20 minutes of wall-clock time before the LLM even starts reasoning. Spider’s HTTP mode returns cached pages in roughly 10-15ms. Uncached pages requiring Chrome rendering take 1-5 seconds each. With batching (sending all 40 URLs in one request), the scraping step drops from minutes to seconds. The point: your scraper should not be the bottleneck in the agent loop.

Token cost is a function of content quality. Raw HTML from a typical page runs 10,000 to 80,000 tokens depending on page complexity. Clean markdown of the same page runs 1,000 to 5,000 tokens. When every page goes through an LLM, the difference is often 10x or more in token cost. Spider strips navigation, ads, footers, and boilerplate, returning markdown that is ready for an LLM context window without post-processing.

Batch support reduces round trips. Spider accepts multiple URLs in a single API call. Instead of issuing 20 sequential HTTP requests, an agent can send all 20 URLs at once and get results back in a single response. This matters for agents that collect a list of URLs from a search step and then need to scrape all of them.

Structured extraction without selectors. Spider’s prompt-based extraction lets you describe what you want in natural language and get JSON back. For data extraction agents, this eliminates brittle parsing logic. The trade-off: prompt-based extraction is slower and less deterministic than a well-maintained CSS selector. For sites you control or scrape frequently, selectors are still the better choice.

Pattern 1: Research agent

A research agent takes a question, searches the web, crawls relevant pages, synthesizes the content, and produces an answer. The key challenge is deciding when to stop searching and start answering.

Architecture

User question
    |
    v
[Query planner] --> generates search queries
    |
    v
[Spider search] --> returns ranked URLs
    |
    v
[Spider scrape]  --> fetches pages as markdown (batch)
    |
    v
[Evaluator]     --> checks if content answers the question
    |                    |
    | (insufficient)     | (sufficient)
    v                    v
[Refine query]      [Synthesizer] --> final answer

The agent loops between searching, scraping, and evaluating until it has enough material. A hard cap on iterations prevents runaway costs.

CrewAI implementation

import os
from spider import Spider
from crewai import Agent, Task, Crew, Process
from crewai_tools import tool

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))

@tool("web_search")
def web_search(query: str) -> str:
    """Search the web and return titles, URLs, and descriptions."""
    results = spider.search(query, params={"limit": 10, "fetch_page_content": False})
    entries = results.get("content", results) if isinstance(results, dict) else results
    output = []
    for r in entries:
        title = r.get("title", "")
        url = r.get("url", "")
        desc = r.get("description", "")
        output.append(f"- [{title}]({url}): {desc}")
    return "\n".join(output)

@tool("scrape_urls")
def scrape_urls(urls: str) -> str:
    """Scrape one or more URLs (comma-separated) and return their content as markdown."""
    url_list = [u.strip() for u in urls.split(",")]
    results = []
    for url in url_list[:10]:  # cap at 10 pages per call
        data = spider.scrape_url(url, params={"return_format": "markdown"})
        if data and len(data) > 0:
            content = data[0].get("content", "")
            results.append(f"## {url}\n\n{content[:8000]}")
    return "\n\n---\n\n".join(results)

researcher = Agent(
    role="Senior Research Analyst",
    goal="Find accurate, comprehensive information to answer: {question}",
    backstory=(
        "You are a meticulous researcher. You search the web, read sources, "
        "and cross-reference claims before drawing conclusions. You never "
        "fabricate information. If sources conflict, you note the disagreement."
    ),
    tools=[web_search, scrape_urls],
    verbose=True,
    memory=True,
    allow_delegation=False,
)

synthesizer = Agent(
    role="Technical Writer",
    goal="Write a clear, well-sourced answer to: {question}",
    backstory=(
        "You turn raw research into polished answers. You cite sources, "
        "highlight key findings, and flag uncertainties. You never add "
        "information beyond what the research provides."
    ),
    verbose=True,
    memory=True,
    allow_delegation=False,
)

research_task = Task(
    description=(
        "Research the following question thoroughly: {question}\n\n"
        "1. Search the web for relevant sources.\n"
        "2. Scrape the top 5-8 most relevant pages.\n"
        "3. Extract key facts, data points, and expert opinions.\n"
        "4. If initial results are insufficient, refine your query and search again.\n"
        "5. Compile your findings with source URLs."
    ),
    expected_output="Detailed research notes with source URLs and key findings.",
    tools=[web_search, scrape_urls],
    agent=researcher,
)

synthesis_task = Task(
    description=(
        "Using the research notes provided, write a comprehensive answer "
        "to the question: {question}\n\n"
        "Include source citations. Flag any conflicting information."
    ),
    expected_output="A well-structured answer with citations, 4-8 paragraphs.",
    agent=synthesizer,
)

crew = Crew(
    agents=[researcher, synthesizer],
    tasks=[research_task, synthesis_task],
    process=Process.sequential,
    memory=True,
    cache=True,
    max_rpm=100,
)

result = crew.kickoff(inputs={
    "question": "What are the current best practices for fine-tuning LLMs on domain-specific data?"
})
print(result)

The researcher scrapes pages in batches and caps content at 8,000 characters per page to stay within context limits. The synthesizer only works with material the researcher collected, which reduces hallucination.

AutoGen implementation

import os
from spider import Spider
from autogen import ConversableAgent, register_function
from typing import List, Dict, Any
from typing_extensions import Annotated

config_list = [
    {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")}
]

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))

def search_web(
    query: Annotated[str, "The search query"]
) -> Annotated[str, "Search results with titles and URLs"]:
    """Search the web for information on a topic."""
    results = spider.search(query, params={"limit": 8, "fetch_page_content": False})
    entries = results.get("content", results) if isinstance(results, dict) else results
    output = []
    for r in entries:
        output.append(f"- {r.get('title', '')}: {r.get('url', '')}")
    return "\n".join(output)

def scrape_pages(
    urls: Annotated[str, "Comma-separated URLs to scrape"]
) -> Annotated[str, "Markdown content from scraped pages"]:
    """Scrape web pages and return their content as markdown."""
    url_list = [u.strip() for u in urls.split(",")]
    results = []
    for url in url_list[:8]:
        data = spider.scrape_url(url, params={"return_format": "markdown"})
        if data and len(data) > 0:
            content = data[0].get("content", "")[:6000]
            results.append(f"## Source: {url}\n{content}")
    return "\n\n".join(results)

research_agent = ConversableAgent(
    "Researcher",
    llm_config={"config_list": config_list},
    system_message=(
        "You are a research agent. Search the web, scrape relevant pages, "
        "and compile findings. When you have enough information to answer "
        "the question thoroughly, write your final answer and end with TERMINATE."
    ),
)

user_proxy = ConversableAgent(
    "UserProxy",
    llm_config=False,
    human_input_mode="NEVER",
    code_execution_config=False,
    is_termination_msg=lambda x: x.get("content", "") is not None
        and "terminate" in x.get("content", "").lower(),
    default_auto_reply="Continue researching if not done. Otherwise TERMINATE.",
)

register_function(search_web, caller=research_agent, executor=user_proxy,
    name="search_web", description="Search the web for a topic.")
register_function(scrape_pages, caller=research_agent, executor=user_proxy,
    name="scrape_pages", description="Scrape URLs and return markdown content.")

result = user_proxy.initiate_chat(
    research_agent,
    message="Research: What are the performance differences between PostgreSQL and CockroachDB for OLTP workloads?",
    summary_method="reflection_with_llm",
    max_turns=10,
)
print(result.summary)

The max_turns parameter acts as a cost ceiling. The agent can search and scrape multiple times within that budget, but it cannot loop indefinitely.

Pattern 2: Monitoring agent

A monitoring agent watches a set of URLs for changes, compares new content against a baseline, and triggers alerts when something meaningful shifts. This is useful for tracking competitor pricing, regulatory updates, documentation changes, or news on a specific topic.

Architecture

Scheduled trigger (cron / interval)
    |
    v
[Spider scrape] --> fetches current page content
    |
    v
[Diff engine]   --> compares against stored baseline
    |
    v
[LLM classifier] --> determines if change is meaningful
    |                    |
    | (noise)            | (meaningful)
    v                    v
[Skip]              [Alert + update baseline]

The LLM classifier is what separates this from a simple diff. Copyright year changes, minor formatting tweaks, and rotated ad blocks are noise. New product launches, price changes, and policy updates are meaningful. The LLM makes that judgment call.

Implementation

import os
import json
import hashlib
import difflib
from datetime import datetime
from spider import Spider
from openai import OpenAI

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
BASELINE_FILE = "baselines.json"

def load_baselines() -> dict:
    if os.path.exists(BASELINE_FILE):
        with open(BASELINE_FILE, "r") as f:
            return json.load(f)
    return {}

def save_baselines(baselines: dict):
    with open(BASELINE_FILE, "w") as f:
        json.dump(baselines, f, indent=2)

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

def get_diff(old: str, new: str) -> str:
    old_lines = old.splitlines(keepends=True)
    new_lines = new.splitlines(keepends=True)
    diff = difflib.unified_diff(old_lines, new_lines, lineterm="")
    return "".join(list(diff)[:200])  # cap diff size for LLM context

def classify_change(url: str, diff_text: str, context: str) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": (
                    "You classify website changes. Respond with JSON: "
                    '{"meaningful": true/false, "category": "...", "summary": "..."}\n'
                    "Categories: pricing, product, policy, content, cosmetic, error.\n"
                    "Cosmetic changes (formatting, copyright year, ad rotation) are not meaningful."
                ),
            },
            {
                "role": "user",
                "content": f"URL: {url}\n\nDiff:\n{diff_text}\n\nContext:\n{context[:3000]}",
            },
        ],
    )
    return json.loads(response.choices[0].message.content)

def send_alert(url: str, classification: dict):
    # Replace with your preferred notification method:
    # Slack webhook, email, PagerDuty, database insert, etc.
    print(f"ALERT [{classification['category'].upper()}] {url}")
    print(f"  Summary: {classification['summary']}")

def monitor(urls: list[str]):
    baselines = load_baselines()

    for url in urls:
        data = spider.scrape_url(url, params={"return_format": "markdown"})
        if not data or len(data) == 0:
            print(f"  Failed to fetch {url}, skipping.")
            continue

        current_content = data[0].get("content", "")
        current_hash = content_hash(current_content)

        if url not in baselines:
            baselines[url] = {
                "hash": current_hash,
                "content": current_content,
                "last_checked": datetime.now().isoformat(),
            }
            print(f"  Baseline set for {url}")
            continue

        if current_hash == baselines[url]["hash"]:
            baselines[url]["last_checked"] = datetime.now().isoformat()
            continue

        # Content changed. Classify the change.
        diff_text = get_diff(baselines[url]["content"], current_content)
        classification = classify_change(url, diff_text, current_content)

        if classification.get("meaningful", False):
            send_alert(url, classification)

        # Update baseline regardless of whether the change was meaningful.
        baselines[url] = {
            "hash": current_hash,
            "content": current_content,
            "last_checked": datetime.now().isoformat(),
            "last_change": classification,
        }

    save_baselines(baselines)

# Monitor these URLs on a schedule (cron, Celery beat, etc.)
watch_list = [
    "https://openai.com/pricing",
    "https://docs.anthropic.com/en/docs/about-claude/models",
    "https://cloud.google.com/vertex-ai/generative-ai/pricing",
]

monitor(watch_list)

Run this on a schedule (every hour, every 15 minutes, whatever fits the use case). Spider’s caching can be disabled with "cache": False when you need guaranteed fresh content, or left enabled to reduce costs when near-real-time freshness is acceptable.

Scaling the watch list

The naive implementation above scrapes URLs sequentially. For larger watch lists (hundreds or thousands of URLs), use Spider’s batch endpoint and Python’s concurrent.futures to parallelize.

from concurrent.futures import ThreadPoolExecutor, as_completed

def monitor_batch(urls: list[str], max_workers: int = 10):
    baselines = load_baselines()

    def check_url(url):
        data = spider.scrape_url(url, params={
            "return_format": "markdown",
            "cache": False,
        })
        if not data or len(data) == 0:
            return url, None
        return url, data[0].get("content", "")

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(check_url, url): url for url in urls}
        for future in as_completed(futures):
            url, content = future.result()
            if content is None:
                continue
            # Same diff/classify/alert logic as above
            # ...

    save_baselines(baselines)

Spider’s batching support means you can monitor thousands of pages in a single API call, keeping the wall-clock time for each monitoring cycle under a minute.

Pattern 3: Data extraction agent

A data extraction agent visits pages and pulls structured data into a predefined schema. Unlike traditional scraping with CSS selectors or XPath, the agent uses natural language prompts to describe what it wants, and Spider’s extraction layer returns JSON.

This pattern is ideal for pulling product catalogs, job listings, event schedules, contact directories, or any semi-structured content where the page layout varies across sites.

Architecture

Input: list of URLs + target schema
    |
    v
[Spider scrape with extraction prompt]
    |
    v
[Schema validator] --> checks extracted JSON against schema
    |                    |
    | (valid)            | (invalid / partial)
    v                    v
[Store result]       [Retry with refined prompt or fallback scrape]

Implementation with Spider’s extraction

Spider supports prompt-based extraction natively. You send a natural language prompt describing the data you want, and Spider returns structured JSON. No selectors, no parsing code.

import os
import json
from spider import Spider

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))

def extract_structured_data(urls: list[str], extraction_prompt: str) -> list[dict]:
    """
    Extract structured data from a list of URLs using Spider's
    prompt-based extraction.
    """
    results = []

    for url in urls:
        data = spider.scrape_url(url, params={
            "return_format": "markdown",
            "extra_ai_data": True,
            "prompt": extraction_prompt,
        })

        if data and len(data) > 0:
            page = data[0]
            extracted = page.get("extra_ai_data", page.get("content", ""))
            results.append({
                "url": url,
                "data": extracted,
            })

    return results

# Example: extract product data from e-commerce pages
product_prompt = (
    "Extract product information as JSON with these fields: "
    "name (string), price (number), currency (string), "
    "description (string, max 200 chars), in_stock (boolean), "
    "rating (number, 0-5), review_count (number). "
    "If a field is not found, use null."
)

urls = [
    "https://example.com/product/widget-pro",
    "https://example.com/product/widget-lite",
    "https://example.com/product/widget-max",
]

products = extract_structured_data(urls, product_prompt)
for p in products:
    print(json.dumps(p, indent=2))

CrewAI data extraction agent

For more complex extraction tasks where the agent needs to discover URLs, navigate pagination, or handle multiple page types, CrewAI provides the orchestration layer.

import os
import json
from spider import Spider
from crewai import Agent, Task, Crew, Process
from crewai_tools import tool

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))

@tool("crawl_site")
def crawl_site(url: str) -> str:
    """Crawl a website and return a list of discovered page URLs."""
    data = spider.crawl_url(url, params={
        "return_format": "markdown",
        "limit": 50,
    })
    if not data:
        return "No pages found."
    urls = [page.get("url", "") for page in data if page.get("url")]
    return "\n".join(urls)

@tool("extract_data")
def extract_data(url_and_prompt: str) -> str:
    """Extract structured data from a URL.
    Input format: URL ||| extraction prompt"""
    parts = url_and_prompt.split("|||")
    if len(parts) != 2:
        return "Error: provide input as 'URL ||| extraction prompt'"

    url = parts[0].strip()
    prompt = parts[1].strip()

    data = spider.scrape_url(url, params={
        "return_format": "markdown",
        "extra_ai_data": True,
        "prompt": prompt,
    })

    if data and len(data) > 0:
        extracted = data[0].get("extra_ai_data", data[0].get("content", ""))
        return json.dumps({"url": url, "data": extracted}, indent=2)
    return json.dumps({"url": url, "error": "extraction failed"})

extractor = Agent(
    role="Data Extraction Specialist",
    goal="Extract structured {data_type} data from {target_site}",
    backstory=(
        "You are an expert at extracting structured data from websites. "
        "You first crawl the site to discover relevant pages, then extract "
        "data from each page using specific prompts. You validate that "
        "extracted data matches the expected schema before including it."
    ),
    tools=[crawl_site, extract_data],
    verbose=True,
    memory=True,
)

extraction_task = Task(
    description=(
        "Extract {data_type} data from {target_site}.\n\n"
        "Steps:\n"
        "1. Crawl the site to discover relevant pages.\n"
        "2. Identify pages that contain {data_type} information.\n"
        "3. Extract data from each page using this prompt: {extraction_prompt}\n"
        "4. Compile all extracted data into a single JSON array.\n"
        "5. Remove duplicates and validate completeness."
    ),
    expected_output="A JSON array of extracted {data_type} records.",
    agent=extractor,
)

crew = Crew(
    agents=[extractor],
    tasks=[extraction_task],
    process=Process.sequential,
    max_rpm=100,
)

result = crew.kickoff(inputs={
    "target_site": "https://example.com/jobs",
    "data_type": "job listing",
    "extraction_prompt": (
        "Extract job listing as JSON: title (string), company (string), "
        "location (string), salary_range (string or null), "
        "posted_date (string), requirements (list of strings)."
    ),
})
print(result)

The agent discovers pages through crawling, then applies extraction prompts to each relevant page. Spider handles both the crawling and the structured extraction, so the agent logic stays focused on orchestration.

Common failure modes

Building agents that work in demos is straightforward. Building agents that work at 3 AM on a Saturday without supervision is harder. Here are the failure modes that show up in production.

Rate limiting and throttling

Most scraping APIs impose rate limits that agents can hit quickly. An agent that fans out to 50 URLs in a tight loop will get throttled by most providers, causing retries, backoff delays, and unpredictable task completion times.

Spider’s rate limit is high enough that most agent workloads never hit it. If you do approach the limit, the API returns standard rate limit headers (X-RateLimit-Remaining, Retry-After) so your agent can back off gracefully rather than retrying blindly.

Stale data from caching

Caching helps with cost but can cause problems for monitoring agents. If your agent checks a pricing page and gets a cached version from 6 hours ago, the price change you are watching for will be invisible.

Use "cache": False in Spider requests when freshness matters. For research agents where slight staleness is acceptable, leave caching enabled to reduce costs and improve response times.

Hallucination from bad scrapes

When a scrape fails silently (returns a login wall, a CAPTCHA page, or a cookie consent overlay instead of the actual content), the agent feeds garbage into the LLM. The LLM then hallucinates an answer based on whatever fragments it received. The user sees a confident, well-formatted response built on nothing.

Defend against this at two levels. First, check the scraped content before passing it to the LLM. A page that contains fewer than 100 characters, or that contains strings like “please enable JavaScript” or “access denied,” is probably not real content. Second, use Spider’s smart mode, which automatically detects when JavaScript rendering is needed and switches to headless Chrome.

def validate_content(content: str) -> bool:
    """Check if scraped content looks like real page content."""
    if not content or len(content.strip()) < 100:
        return False
    noise_signals = [
        "please enable javascript",
        "access denied",
        "captcha",
        "cookies must be enabled",
        "403 forbidden",
    ]
    content_lower = content.lower()
    return not any(signal in content_lower for signal in noise_signals)

Cost runaway

An agent with a search-scrape-evaluate loop and no iteration cap can burn through API credits fast. Each loop iteration costs LLM tokens (for evaluation) plus scraping credits (for new pages). Without guardrails, a vague question can trigger dozens of iterations.

Set hard limits at multiple levels:

# Cap iterations in the agent loop
MAX_ITERATIONS = 5

# Cap pages per scrape call
MAX_PAGES_PER_REQUEST = 10

# Cap total content length sent to LLM
MAX_CONTEXT_CHARS = 50000

# Use Spider's cost controls
params = {
    "return_format": "markdown",
    "max_credits_per_page": 10,   # cap per-page cost
    "max_credits_allowed": 500,   # cap total cost per request
}

These limits turn unbounded agent behavior into predictable, budgeted operations.

Brittle selectors in a world of changing layouts

Traditional scrapers break when a site changes its HTML structure. CSS selectors and XPath expressions are coupled to specific DOM layouts. When a site redesigns, every selector needs to be updated.

Spider’s prompt-based extraction sidesteps this entirely. Instead of div.price > span.amount, you write “extract the product price as a number.” The extraction layer understands the page semantically, so layout changes rarely break it. This is especially valuable for data extraction agents that target many different sites with varying page structures.

The path to fully autonomous browsing

The patterns above still rely on external LLMs (OpenAI, Anthropic, etc.) for reasoning. Every LLM call adds latency, cost, and a dependency on a third-party API.

Spider already runs its own extraction models that handle most structured HTML-to-JSON conversion without calling an external LLM. For agent builders, this means the extraction step adds negligible latency and zero per-token cost. The scraping layer and the intelligence layer collapse into a single call.

Getting started

The code above is a starting point. In production, you will need to add: retry logic for transient API failures, token counting to prevent context overflow, cost tracking per agent run, and evaluation harnesses to measure whether your agent actually answers questions correctly. Start with the research agent pattern — it is the most general and the easiest to evaluate.

One thing the examples above do not address: how do you know the agent is producing good answers? Agent evaluation is the hardest unsolved problem in this space. At minimum, build a test set of 20-30 questions with known answers and run your agent against it after every change. Without that, you are shipping blind.

A note on framework versions: AutoGen is undergoing significant API restructuring (the AG2 fork and autogen-agentchat 0.4+). Pin your dependency version and check the migration guide before upgrading. CrewAI’s tool-calling behavior can be unpredictable with complex tool schemas — test thoroughly with your actual tools before trusting it in production.

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.