5/20/2024 •

7 min read

Guide - Scrape & Crawl Agent with Microsoft’s Autogen

This guide will show you how to set up an Autogen agent to scrape and crawl any website using the Spider API.

Setup OpenAI
Setup Spider & Autogen
Creating Scrape & Crawl Functions
Using the agents
Full code.

Setup OpenAI

Get OpenAI setup and running in a few minutes with the following steps:

Create an account and get an API Key on OpenAI.
Install OpenAI and set up the API key in your project as an environment variable. This approach prevents you from hardcoding the key in your code.

pip install openai

In your terminal:

export OPENAI_API_KEY=<your-api-key-here>

Alternatively, you can use the dotenv package to load the environment variables from a .env file. Create a .env file in your project root and add the following:

OPENAI_API_KEY=<your-api-key-here>

Then, in your Python code:

from dotenv import load_dotenv
from openai import OpenAI
import os

load_dotenv()

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

Test OpenAI to see if things are working correctly:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

chat_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": "What are large language models?",
        }
    ]
)

Setup Spider & Autogen

Getting started with the API is simple and straightforward. After you get your secret key you can follow this guide. We won’t rehash the full setup guide for Spider here, but if you want to use the API directly, you can check out the Spider API Guide to learn more. Let’s move on.

Install the Spider Python client library and autogen:

pip install spider_client pyautogen

Now we need to setup the Autogen LLM configuration.

import os

config_list = [
    {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")},
]

And we need to set the Spider API key:

spider_api_key = os.getenv("SPIDER_API_KEY")

Creating Scrape & Crawl Functions

We first need to import spider so that we can call the API to be able to scrape and crawl.

from spider import Spider

Defining functions for the agents

Now we need to define the scrape and crawl function that the agent will call. We will use the python Spider SDK for this and set the default return_format to markdown to retrieve LLM-ready data.

from typing_extensions import Annotated
from typing import List, Dict, Any

def scrape_page(url: Annotated[str, "The URL of the web page to scrape"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[Dict[str, Any], "Scraped content"]:
    # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
    client = Spider(spider_api_key)

    if params == None:
        params = {
            "return_format": "markdown"
        }

    scraped_data = client.scrape_url(url, params)
    return scraped_data[0]

def crawl_page(url: Annotated[str, "The url of the domain to be crawled"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[List[Dict[str, Any]], "Scraped content"]:
    # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
    client = Spider(spider_api_key)

    if params == None:
        params = {
            "return_format": "markdown"
        }

    crawled_data = client.crawl_url(url, params)
    return crawled_data

Now that we have the functions defined, we need to create the scrape & crawl agents, and let them know that they can use the functions to scrape & crawl any website.

Here is also when we use the config_list we defined at the top of this guide.

from autogen import ConversableAgent

# Create web scraper agent.
scraper_agent = ConversableAgent(
    "WebScraper",
    llm_config={"config_list": config_list},
    system_message="You are a web scraper and you can scrape any web page to retrieve its contents."
    "Returns 'TERMINATE' when the scraping is done.",
)

# Create web crawler agent.
crawler_agent = ConversableAgent(
    "WebCrawler",
    llm_config={"config_list": config_list},
    system_message="You are a web crawler and you can crawl any page with deeper crawling following subpages."
    "Returns 'TERMINATE' when the scraping is done.",
)

How do we tell the agents to do things?

To be able to chat and make these agents actually do something, we need a UserProxyAgent that can communicate with the other agents:

You can read more about the UserProxyAgent here.

user_proxy_agent = ConversableAgent(
    "UserProxy",
    llm_config=False,  # No LLM for this agent.
    human_input_mode="NEVER",
    code_execution_config=False,  # No code execution for this agent.
    is_termination_msg=lambda x: x.get("content", "") is not None and "terminate" in x["content"].lower(),
    default_auto_reply="Please continue if not finished, otherwise return 'TERMINATE'.",
)

Registering the functions

Now when we have the agents and the user_proxy_agent we can officially register the functions with the correct agents, and the agents with the user_proxy_agent using register_function provided from autogen.

from autogen import register_function

register_function(
    scrape_page,
    caller=scraper_agent,
    executor=user_proxy_agent,
    name="scrape_page",
    description="Scrape a web page and return the content.",
)

register_function(
    crawl_page,
    caller=crawler_agent,
    executor=user_proxy_agent,
    name="crawl_page",
    description="Crawl an entire domain, following subpages and return the content.",
)

Now we have officially linked all the agents together and can try talking to user_proxy_agent.

Using the agents

We can start the conversation with user_proxy_agent and say that we either want to crawl or scrape a specific website.

Then we can summarize the scraped and crawled page with Autogen’s built in summary_method. We use reflection_with_llm to create a summary based on the conversation history, AKA the scraped or crawled content.

# Scrape page
scraped_chat_result = user_proxy_agent.initiate_chat(
    scraper_agent,
    message="Can you scrape william-espegren.com for me?",
    summary_method="reflection_with_llm",
    summary_args={
        "summary_prompt": """Summarize the scraped content"""
    },
)

# Crawl page
crawled_chat_result = user_proxy_agent.initiate_chat(
    crawler_agent,
    message="Can you crawl william-espegren.com for me, I want the whole domains information?",
    summary_method="reflection_with_llm",
    summary_args={
        "summary_prompt": """Summarize the crawled content"""
    },
)

The output is stored in the summary:

print(scraped_chat_result.summary)
print(crawled_chat_result.summary)

Full code.

Now we have two agents: one that scrapes a page and one that crawls a page following subpages. These two agents can you use in combination with your other Autogen agents.

import os
from spider import Spider
from typing_extensions import Annotated
from typing import List, Dict, Any
from autogen import ConversableAgent
from autogen import register_function

config_list = [
    {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")},
]

spider_api_key = os.getenv("SPIDER_API_KEY")

def scrape_page(url: Annotated[str, "The URL of the web page to scrape"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[Dict[str, Any], "Scraped content"]:
    # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
    client = Spider(spider_api_key)

    if params == None:
        params = {
            "return_format": "markdown"
        }

    scraped_data = client.scrape_url(url, params)
    return scraped_data[0]

def crawl_page(url: Annotated[str, "The url of the domain to be crawled"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[List[Dict[str, Any]], "Scraped content"]:
    # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
    client = Spider(spider_api_key)

    if params == None:
        params = {
            "return_format": "markdown"
        }

    crawled_data = client.crawl_url(url, params)
    return crawled_data

# Create web scraper agent.
scraper_agent = ConversableAgent(
    "WebScraper",
    llm_config={"config_list": config_list},
    system_message="You are a web scraper and you can scrape any web page to retrieve its contents."
    "Returns 'TERMINATE' when the scraping is done.",
)

# Create web crawler agent.
crawler_agent = ConversableAgent(
    "WebCrawler",
    llm_config={"config_list": config_list},
    system_message="You are a web crawler and you can crawl any page with deeper crawling following subpages."
    "Returns 'TERMINATE' when the scraping is done.",
)

user_proxy_agent = ConversableAgent(
    "UserProxy",
    llm_config=False,  # No LLM for this agent.
    human_input_mode="NEVER",
    code_execution_config=False,  # No code execution for this agent.
    is_termination_msg=lambda x: x.get("content", "") is not None and "terminate" in x["content"].lower(),
    default_auto_reply="Please continue if not finished, otherwise return 'TERMINATE'.",
)

register_function(
    scrape_page,
    caller=scraper_agent,
    executor=user_proxy_agent,
    name="scrape_page",
    description="Scrape a web page and return the content.",
)

register_function(
    crawl_page,
    caller=crawler_agent,
    executor=user_proxy_agent,
    name="crawl_page",
    description="Crawl an entire domain, following subpages and return the content.",
)

# Scrape page
scraped_chat_result = user_proxy_agent.initiate_chat(
    scraper_agent,
    message="Can you scrape william-espegren.com for me?",
    summary_method="reflection_with_llm",
    summary_args={
        "summary_prompt": """Summarize the scraped content"""
    },
)

print(scraped_chat_result.summary)

If you liked this guide, consider checking out me and Spider on Twitter:

Author Twitter: @WilliamEspegren
Spider Twitter: @spider_rust
Related Article: Dify + Spider Workflow by: Tomihide Kaketani

Empower any project with AI-ready data for LLMs

Get started