NEW AI Studio is now available Try it now
LlamaIndex logo + Spider logo

Data Connector

Ingest the entire web into your LlamaIndex pipeline

SpiderWebReader crawls sites and returns clean LlamaIndex Documents. Plug them straight into your vector store, knowledge graph, or any index type. No parsing glue code needed.

Ingestion Pipeline

Spider Crawl Reader Chunking Embedding Index Query
from llama_index.readers.web import SpiderWebReader

# Initialize the reader
reader = SpiderWebReader()

# Crawl a docs site and get LlamaIndex Documents
documents = reader.load_data(
    url="https://docs.example.com",
    mode="crawl",
    params={
        "return_format": "markdown",
        "limit": 100,
    }
)

# Build a vector index from the crawl results
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("How do I configure authentication?")
print(response)

Native Reader

SpiderWebReader implements the LlamaIndex BaseReader interface. It returns Document objects with text and metadata that work with every index type out of the box.

Crawl, Scrape, or Search

Point it at a single URL or let it discover an entire site. Switch between scrape, crawl, and search modes with one parameter.

Chunk-Friendly Markdown

Spider strips navigation, ads, and scripts, then returns clean markdown. Sentence splitters and token chunkers perform better on this output than raw HTML.

Metadata Preserved

Each Document carries its source URL, page title, and crawl metadata. Your query engine can filter by source, and citations trace back to the original page.

All Spider Parameters

Proxy modes, browser rendering, readability, CSS selectors, depth limits. Everything Spider supports passes through the reader's params dict.

Scales to Thousands

Spider crawls pages concurrently on its infrastructure, not yours. Ingest a 10,000-page site without managing headless browsers or rate limits locally.

Web Search as a Query Tool

Give your LlamaIndex agent access to live web results. Spider searches, fetches, and returns content your agent can reason over.

from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import ReActAgent
from spider import Spider

spider = Spider()

def web_search(query: str) -> str:
    """Search the web and return page contents."""
    results = spider.search(query, params={
        "search_limit": 5,
        "fetch_page_content": True,
        "return_format": "markdown",
    })
    return "\n---\n".join(
        f"Source: {r['url']}\n{r['content'][:2000]}"
        for r in results if r.get("content")
    )

search_tool = FunctionTool.from_defaults(fn=web_search)
agent = ReActAgent.from_tools([search_tool], llm=OpenAI(model="gpt-4o"))

response = agent.chat("What are the latest OWASP Top 10 changes?")
print(response)

Start building with LlamaIndex + Spider

Free credits on signup. No subscription required.