3 min read
Crawling Authenticated Pages with Spider
Contents
Prerequisites
- Ensure you have your Spider API key.
- Basic knowledge of making HTTP requests with Python.
- Install the Spider client for Python.
Setting Up Your Environment
pip install spider-client
Using the Spider API
Directly Setting the Cookie
One of the simplest ways to handle authenticated pages is by setting the cookie directly in your request.
Example
from spider import Spider
import os
app = Spider(api_key='your_api_key')
json_data = {"url": "http://www.example.com", "return_format": "markdown", "cookies": "mycookie"}
response = app.crawl_url(json_data['url'], params=json_data, headers=headers)
print(response)
Using execution_scripts
to Run Custom JavaScript
The execution_scripts
parameter allows you to run custom JavaScript code on the page. This can be particularly useful for actions like logging in through a form.
Example
First, let’s create a script that handles the login process:
document.querySelector('input[name="username"]').value = 'your_username';
document.querySelector('input[name="password"]').value = 'your_password';
document.querySelector('form').submit();
Now incorporate this into the request:
from spider import Spider
app = Spider(api_key='your_api_key')
execution_scripts = {
"http://www.example.com/login": `
document.addEventListener("DOMContentLoaded", function() {
document.querySelector("#username").value = "your_username";
document.querySelector("#password").value = "your_password";
document.querySelector("form").submit();
});
`
}
json_data = {
"return_cookies": True,
"return_headers": True,
"request": "chrome",
"execution_scripts": execution_scripts,
"return_format": "markdown",
"url": "http://www.example.com/login"
}
response = app.crawl_url(json_data['url'], params=json_data)
print(response)
Using GPT Config for AI-Driven Actions
You can use the gpt_config
option to run custom AI actions that interact with the page, filling in forms or navigating through login sequences using models like GPT-4. If your account does not have at least $50 in credits, you will need to provide your own OpenAI API key.
from spider import Spider
app = Spider(api_key='your_api_key')
gpt_config = {
"prompt": ["login with the username 'henry@mail.com' and password `something`.", "extract the main article"],
"model": "gpt-4o",
"max_tokens": 2000,
"temperature": 0.54,
"top_p": 0.17,
"api_key": None # your_openai_api_key is only necessary if your credits are under $50
}
json_data = {
"url": "http://www.example.com/authenticated-page",
"gpt_config": gpt_config,
"return_format": "markdown"
}
response = app.crawl_url(json_data['url'], params=json_data)
print(response)
GPT Configs Structure
GPT configs in JSON format for your reference:
{
"prompt": ["login with the username 'henry@mail.com' and password `something`.", "extract the main article"],
"model": "gpt-4o",
"max_tokens": 2000,
"temperature": 0.54,
"user": null,
"top_p": 0.17,
"prompt_url_map": null,
"extra_ai_data": true,
"paths_map": true,
"screenshot": false,
"api_key": "your_openai_api_key",
"cache": null
}
Additional Parameters
Spider provides additional parameters to control the behavior of your crawl. Here are some useful ones:
- wait_for: Specifies various waiting conditions.
- blacklist: Blacklist certain paths from being crawled.
- whitelist: Whitelist certain paths for crawling.
- subdomains: Allow subdomains to be included.
- user_agent: Set a custom user agent.
- fingerprint: Use advanced fingerprinting for Chrome.
- storageless: Disable storing of crawled data.
- readability: Pre-process the content for reading.
- chunking_alg: Segment your content output.
For a full list of parameters, refer to the Spider API Reference.
Conclusion
Using the Spider API, you can effectively handle authenticated pages by setting cookies directly, running custom JavaScript, or using AI-driven actions. With these methods, you can ensure your crawling tasks access the necessary protected resources.