Data Connector
Ingest the entire web into your LlamaIndex pipeline
SpiderWebReader crawls sites and returns clean LlamaIndex Documents. Plug them straight into your vector store, knowledge graph, or any index type. No parsing glue code needed.
Ingestion Pipeline
from llama_index.readers.web import SpiderWebReader
# Initialize the reader
reader = SpiderWebReader()
# Crawl a docs site and get LlamaIndex Documents
documents = reader.load_data(
url="https://docs.example.com",
mode="crawl",
params={
"return_format": "markdown",
"limit": 100,
}
)
# Build a vector index from the crawl results
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("How do I configure authentication?")
print(response) Native Reader
SpiderWebReader implements the LlamaIndex BaseReader interface. It returns Document objects with text and metadata that work with every index type out of the box.
Crawl, Scrape, or Search
Point it at a single URL or let it discover an entire site. Switch between scrape, crawl, and search modes with one parameter.
Chunk-Friendly Markdown
Spider strips navigation, ads, and scripts, then returns clean markdown. Sentence splitters and token chunkers perform better on this output than raw HTML.
Metadata Preserved
Each Document carries its source URL, page title, and crawl metadata. Your query engine can filter by source, and citations trace back to the original page.
All Spider Parameters
Proxy modes, browser rendering, readability, CSS selectors, depth limits. Everything Spider supports passes through the reader's params dict.
Scales to Thousands
Spider crawls pages concurrently on its infrastructure, not yours. Ingest a 10,000-page site without managing headless browsers or rate limits locally.
Web Search as a Query Tool
Give your LlamaIndex agent access to live web results. Spider searches, fetches, and returns content your agent can reason over.
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import ReActAgent
from spider import Spider
spider = Spider()
def web_search(query: str) -> str:
"""Search the web and return page contents."""
results = spider.search(query, params={
"search_limit": 5,
"fetch_page_content": True,
"return_format": "markdown",
})
return "\n---\n".join(
f"Source: {r['url']}\n{r['content'][:2000]}"
for r in results if r.get("content")
)
search_tool = FunctionTool.from_defaults(fn=web_search)
agent = ReActAgent.from_tools([search_tool], llm=OpenAI(model="gpt-4o"))
response = agent.chat("What are the latest OWASP Top 10 changes?")
print(response)