Skip to main content gottem  — one API for every scraper.
LlamaIndex integration

Spider as a LlamaIndex data connector.

SpiderWebReader crawls sites and returns clean LlamaIndex Documents. Plug them into a vector store, knowledge graph, or any index type. No parsing glue code.

Python llama-index SpiderWebReader RAG
Ingestion pipeline Stages
  • 01 Spider Crawl Spider
  • 02 Reader Spider
  • 03 Chunking
  • 04 Embedding
  • 05 Index
  • 06 Query
01 · Data connector

Crawl a site, build a vector index.

SpiderWebReader returns Document objects with text and metadata. Pass them to VectorStoreIndex or any other LlamaIndex index type.

SpiderWebReader Python
from llama_index.readers.web import SpiderWebReader

# Initialize the reader
reader = SpiderWebReader()

# Crawl a docs site and get LlamaIndex Documents
documents = reader.load_data(
    url="https://docs.example.com",
    mode="crawl",
    params={
        "return_format": "markdown",
        "limit": 100,
    }
)

# Build a vector index from the crawl results
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("How do I configure authentication?")
print(response)
02 · What you get

Built for the LlamaIndex stack.

Reader

Native BaseReader

SpiderWebReader implements the LlamaIndex BaseReader interface. It returns Document objects with text and metadata that work with every index type.

Modes

Crawl, scrape, or search

Point it at a single URL or let it discover an entire site. Switch between scrape, crawl, and search modes with one parameter.

Output

Chunk-friendly markdown

Navigation, ads, and scripts are stripped. Sentence splitters and token chunkers perform better on clean markdown than raw HTML.

Metadata

Source preserved

Each Document carries its source URL, page title, and crawl metadata. Filter by source in your query engine and trace citations back to the original page.

Params

All Spider parameters

Proxy modes, browser rendering, readability, CSS selectors, depth limits. Everything Spider supports passes through the reader params dict.

Scale

Concurrent crawl

Spider crawls pages concurrently on its infrastructure. Ingest large sites without managing headless browsers or rate limits locally.

03 · Search as a query tool

Give your ReAct agent live web access.

Wrap Spider's search call as a FunctionTool. The agent decides when to search, fetches content, and reasons over the results.

ReActAgent + Spider search Python
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import ReActAgent
from spider import Spider

spider = Spider()

def web_search(query: str) -> str:
    """Search the web and return page contents."""
    results = spider.search(query, params={
        "search_limit": 5,
        "fetch_page_content": True,
        "return_format": "markdown",
    })
    return "\n---\n".join(
        f"Source: {r['url']}\n{r['content'][:2000]}"
        for r in results if r.get("content")
    )

search_tool = FunctionTool.from_defaults(fn=web_search)
agent = ReActAgent.from_tools([search_tool], llm=OpenAI(model="gpt-4o"))

response = agent.chat("What are the latest OWASP Top 10 changes?")
print(response)
Start

Ingest the web into a LlamaIndex pipeline.

Free balance on sign-up. No subscription required.

pip install spider-client llama-index
See also