Spider as a LlamaIndex data connector.
SpiderWebReader crawls sites and returns clean LlamaIndex Documents. Plug them into a vector store, knowledge graph, or any index type. No parsing glue code.
- 01 Spider Crawl Spider
- 02 Reader Spider
- 03 Chunking
- 04 Embedding
- 05 Index
- 06 Query
Crawl a site, build a vector index.
SpiderWebReader returns Document objects with text and metadata. Pass them to VectorStoreIndex or any other LlamaIndex index type.
from llama_index.readers.web import SpiderWebReader
# Initialize the reader
reader = SpiderWebReader()
# Crawl a docs site and get LlamaIndex Documents
documents = reader.load_data(
url="https://docs.example.com",
mode="crawl",
params={
"return_format": "markdown",
"limit": 100,
}
)
# Build a vector index from the crawl results
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("How do I configure authentication?")
print(response)Built for the LlamaIndex stack.
Native BaseReader
SpiderWebReader implements the LlamaIndex BaseReader interface. It returns Document objects with text and metadata that work with every index type.
Crawl, scrape, or search
Point it at a single URL or let it discover an entire site. Switch between scrape, crawl, and search modes with one parameter.
Chunk-friendly markdown
Navigation, ads, and scripts are stripped. Sentence splitters and token chunkers perform better on clean markdown than raw HTML.
Source preserved
Each Document carries its source URL, page title, and crawl metadata. Filter by source in your query engine and trace citations back to the original page.
All Spider parameters
Proxy modes, browser rendering, readability, CSS selectors, depth limits. Everything Spider supports passes through the reader params dict.
Concurrent crawl
Spider crawls pages concurrently on its infrastructure. Ingest large sites without managing headless browsers or rate limits locally.
Give your ReAct agent live web access.
Wrap Spider's search call as a FunctionTool. The agent decides when to search, fetches content, and reasons over the results.
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import ReActAgent
from spider import Spider
spider = Spider()
def web_search(query: str) -> str:
"""Search the web and return page contents."""
results = spider.search(query, params={
"search_limit": 5,
"fetch_page_content": True,
"return_format": "markdown",
})
return "\n---\n".join(
f"Source: {r['url']}\n{r['content'][:2000]}"
for r in results if r.get("content")
)
search_tool = FunctionTool.from_defaults(fn=web_search)
agent = ReActAgent.from_tools([search_tool], llm=OpenAI(model="gpt-4o"))
response = agent.chat("What are the latest OWASP Top 10 changes?")
print(response)Ingest the web into a LlamaIndex pipeline.
Free balance on sign-up. No subscription required.