LangChain + Groq + Spider Integration Guide
Crawl multiple URLs with Spider’s LangChain loader, then summarize the results through Groq using Meta’s Llama 3 model.
Setup Groq
Get Groq running with the following steps:
-
Create an API Key here
-
Install Groq and set up the API key as an environment variable:
pip install groq
In your terminal:
export GROQ_API_KEY=<your-api-key-here>
Alternatively, you can use the dotenv package to load the environment variables from a .env file. Create a .env file in your project root and add the following:
GROQ_API_KEY=<your-api-key-here>
Then in your Python code:
from dotenv import load_dotenv
import os
load_dotenv()
client = Groq(
api_key=os.environ.get("GROQ_API_KEY"),
)
- Test Groq to verify the setup:
import os
from groq import Groq
client = Groq(
api_key=os.environ.get("GROQ_API_KEY"),
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "What are large language models?",
}
],
model="llama3-8b-8192",
)
We’ll use the Llama 3 8B model for this example, which gives about 800 tokens per second. The larger 70B model gives around 150 tokens per second.
Setup Spider
After you get your API key, you can use the Spider LangChain document loader. For the full API reference, see the Spider API Guide.
Install the Spider Python client library:
pip install spider_client
Set up the API key in your env file:
SPIDER_API_KEY=<your-api-key-here>
Then import the SpiderLoader from the document loaders module:
from langchain_community.document_loaders import SpiderLoader
Set up the Spider loader for our use case:
def load_markdown_from_url(urls):
loader = SpiderLoader(
url=urls,
mode="crawl",
params={
"return_format": "markdown",
"proxy_enabled": False,
"request": "http",
"request_timeout": 60,
"limit": 1,
},
)
data = loader.load()
Set the mode to crawl and we use the return_format parameter to specify we want to markdown content. The rest of the parameters are optional.
Reminder
Spider handles concurrency and IP rotation automatically. Higher credit balances unlock higher concurrency limits, so make sure you have enough credits if you crawl more than one page.
For now, we’ll turn off the proxy and move on to setting up LangChain.
Setup LangChain
Install LangChain if you haven’t yet:
pip install langchain
Let’s install the Groq LangChain package
pip install langchain-groq
LangChain is an LLM orchestration framework. In this example we’ll use it to assemble a prompt and run a chain. See the full LangChain docs for more.
Set up a simple chain using the chat prompt template:
from langchain_core.prompts import ChatPromptTemplate
markdown_content = "Ancana: Marketplace to buy managed vacation homes through fractional ownership | Y Combinator...."
system = "You are a helpful assistant. Please summarize this company in bullet points."
prompt = ChatPromptTemplate.from_messages(
[("system", system), ("human", "{markdown_text}")]
)
chain = prompt | chat
urls = [
"https://www.ycombinator.com/companies/airbnb",
"https://www.ycombinator.com/companies/ancana",
]
summary = chain.invoke({"markdown_text": markdown_content})
print(summary.content)
And the results should be something like this:
## output
**Ancana:**
- Founded in 2019
- Marketplace for buying and owning fractional shares of vacation homes
- Allows users to purchase a share of a property and split expenses with other co-owners
- Offers features like property management, furnishing, and maintenance
- Founded by Andres Barrios and Ryan Black
The goal is to crawl two startup pages from Y Combinator, summarize them, and have the LLM compare the two.
But first we need to combine the markdown content from both companies and format it so that the LLM can easily understand it.
def concat_markdown_content(markdown_content):
final_content = ""
separator = "=" * 40
for content_dict in markdown_content:
url = content_dict["url"]
markdown_text = content_dict["markdown_text"]
title = content_dict["title"]
final_content += f"{separator}\nURL: {url}\nPage Title: {title}\nMarkdown:\n{separator}\n{markdown_text}\n"
return final_content
By adding a demarcation and some metadata, the LLM should have enough context to summarize and compare the two companies.
Update the system prompt to compare both companies:
system = "You are a helpful assistant. Please summarize the markdown content for Ancana and AirBNB provided in the context in bullet points separately. And then provide a summary of how they are similar and different."
That’s all. Here’s the full code:
Full code
from langchain_community.document_loaders import SpiderLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from dotenv import load_dotenv
import os
import time
load_dotenv()
chat = ChatGroq(
temperature=0,
groq_api_key=os.environ.get("GROQ_API_KEY"),
model_name="llama3-70b-8192",
)
def load_markdown_from_urls(url):
loader = SpiderLoader(
url=url,
mode="crawl",
params={
"return_format": "markdown",
"proxy_enabled": False,
"request": "http",
"request_timeout": 60,
"limit": 1,
"cache": False,
},
)
data = loader.load()
if data:
return data
else:
return None
def concat_markdown_content(markdown_content):
final_content = ""
separator = "=" * 40
for content_dict in markdown_content:
url = content_dict["url"]
markdown_text = content_dict["markdown_text"]
title = content_dict["title"]
final_content += f"{separator}\nURL: {url}\nPage Title: {title}\nMarkdown:\n{separator}\n{markdown_text}\n"
return final_content
system = "You are a helpful assistant. Please summarize the markdown content for Ancana and AirBNB provided in the context in bullet points separately. And then provide a summary of how they are similar and different."
prompt = ChatPromptTemplate.from_messages(
[("system", system), ("human", "{markdown_text}")]
)
chain = prompt | chat
start_time = time.time()
urls = [
"https://www.ycombinator.com/companies/airbnb",
"https://www.ycombinator.com/companies/ancana",
]
url_join = ",".join(urls)
markdown_contents = load_markdown_from_urls(url_join)
all_contents = []
for content in markdown_contents:
if content.page_content:
all_contents.append(
{
"markdown_text": content.page_content,
"url": content.metadata["url"],
"title": content.metadata["title"],
}
)
concatenated_markdown = concat_markdown_content(all_contents)
summary = chain.invoke({"markdown_text": concatenated_markdown})
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(summary.content)
# Output
Here are the bullet points summarizing Ancana:
**Ancana**
- Marketplace to buy managed vacation homes through fractional ownership
- Allows individuals to own a share of a vacation home (e.g. 1/4 or 1/8) and split expenses with other co-owners
- Properties are furnished and managed by Ancana, with costs split amongst co-owners
- Owners have access to the property for a set period of time (e.g. 3 months per year)
- Founded in 2019, based in Mérida, Mexico
- Team size: 7
- Founders: Andres Barrios, Ryan Black
And here are the bullet points summarizing Airbnb:
**Airbnb:**
- Online marketplace for booking unique accommodations around the world
- Founded in 2008 and based in San Francisco, California
- Allows users to list and discover unique spaces, from apartments to castles
- Offers a trusted community marketplace for people to monetize their extra space
- Has over 33,000 cities and 192 countries listed on the platform
- Team size: 6,132
**Similarities:**
- Both Ancana and Airbnb operate in the travel and accommodation space
- Both platforms allow users to book and stay in unique properties
- Both companies focus on providing a trusted and community-driven marketplace for users
**Differences:**
- Ancana focuses on fractional ownership of vacation homes, while Airbnb is a booking platform for short-term rentals
- Ancana properties are fully managed and furnished, while Airbnb listings can vary in terms of amenities and services
- Ancana is a newer company with a smaller team size, while Airbnb is a more established company with a larger team and global presence
Benchmarking this example with caching disabled gave an elapsed time of about 7.58 seconds total.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.