3 min read
Crawling Authenticated Pages with Spider.cloud
Contents
- Prerequisites
- Setting Up Your Environment
- Using the Spider.cloud API
- Additional Parameters
- Conclusion
Prerequisites
- Ensure you have your Spider.cloud API key.
- Basic knowledge of making HTTP requests with Python.
- Install the Spider client for Python.
Setting Up Your Environment
pip install spider-client
Using the Spider.cloud API
Directly Setting the Cookie
One of the simplest ways to handle authenticated pages is by setting the cookie directly in your request.
Example
from spider import Spider
import os
# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Cookie': 'YOUR_AUTH_COOKIE',
}
json_data = {"url": "http://www.example.com", "return_format": "markdown"}
response = app.crawl_url(json_data['url'], params=json_data, headers=headers)
print(response)
Using execution_scripts
to Run Custom JavaScript
The execution_scripts
parameter allows you to run custom JavaScript code on the page. This can be particularly useful for actions like logging in through a form.
Example
First, let’s create a script that handles the login process:
document.querySelector('input[name="username"]').value = 'your_username';
document.querySelector('input[name="password"]').value = 'your_password';
document.querySelector('form').submit();
Now incorporate this into the request:
from spider import Spider
# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')
execution_scripts = {
"http://www.example.com/login": `
document.addEventListener("DOMContentLoaded", function() {
document.querySelector("#username").value = "your_username";
document.querySelector("#password").value = "your_password";
document.querySelector("form").submit();
});
`
}
json_data = {
"return_cookies": True,
"return_headers": True,
"request": "chrome",
"execution_scripts": execution_scripts,
"return_format": "markdown",
"url": "http://www.example.com/login"
}
response = app.crawl_url(json_data['url'], params=json_data)
print(response)
Using GPT Config for AI-Driven Actions
You can use the gpt_config
option to run custom AI actions that interact with the page, filling in forms or navigating through login sequences using models like GPT-4. If your account does not have at least $50 in credits, you will need to provide your own OpenAI API key.
from spider import Spider
# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')
gpt_config = {
"prompt": "extract the main article",
"model": "gpt-4",
"max_tokens": 2000,
"temperature": 0.54,
"top_p": 0.17,
"api_key": None # your_openai_api_key is only necessary if your credits are under $50
}
json_data = {
"url": "http://www.example.com/authenticated-page",
"gpt_config": gpt_config,
"return_format": "markdown"
}
response = app.crawl_url(json_data['url'], params=json_data)
print(response)
GPT Configs Structure
GPT configs in JSON format for your reference:
{
"prompt": "extract the main article",
"model": "gpt-4",
"max_tokens": 2000,
"temperature": 0.54,
"user": null,
"top_p": 0.17,
"prompt_url_map": null,
"extra_ai_data": true,
"paths_map": true,
"screenshot": false,
"api_key": "your_openai_api_key",
"cache": null
}
Additional Parameters
Spider.cloud provides additional parameters to control the behavior of your crawl. Here are some useful ones:
- wait_for: Specifies various waiting conditions.
- blacklist: Blacklist certain paths from being crawled.
- whitelist: Whitelist certain paths for crawling.
- subdomains: Allow subdomains to be included.
- user_agent: Set a custom user agent.
- fingerprint: Use advanced fingerprinting for Chrome.
- storageless: Disable storing of crawled data.
- readability: Pre-process the content for reading.
- chunking_alg: Segment your content output.
For a full list of parameters, refer to the Spider.cloud API Reference.
Conclusion
Using the Spider.cloud API, you can effectively handle authenticated pages by setting cookies directly, running custom JavaScript, or using AI-driven actions. With these methods, you can ensure your crawling tasks access the necessary protected resources.
For more details, refer to the official Spider API documentation. Happy Crawling!