Crawling Authenticated Pages with Spider

Table of Contents
Prerequisites
Setting Up Your Environment
Using the Spider API
Additional Parameters
Conclusion

Crawling authenticated pages with Spider requires a few extra steps to ensure you have access to protected resources. There are three main methods for handling authenticated pages:

Setting the “cookie” using the cookies parameter.
Running custom JavaScript with the execution_scripts parameter.
Using gpt_config for AI-driven actions.

This guide will walk you through all three methods.

Prerequisites

Ensure you have your Spider API key.
Basic knowledge of making HTTP requests with Python.
Install the Spider client for Python.

Setting Up Your Environment

pip install spider_client

Using the Spider API

One of the simplest ways to handle authenticated pages is by setting the cookie directly in your request.

Example

from spider import Spider
import os

app = Spider(api_key='your_api_key')

json_data = {"url": "http://www.example.com", "return_format": "markdown", "cookies": "mycookie"}

response = app.crawl_url(json_data['url'], params=json_data, headers=headers)

print(response)

Using `execution_scripts` to Run Custom JavaScript

The execution_scripts parameter allows you to run custom JavaScript code on the page. This can be particularly useful for actions like logging in through a form.

Example

First, let’s create a script that handles the login process:

document.querySelector('input[name="username"]').value = 'your_username';
document.querySelector('input[name="password"]').value = 'your_password';
document.querySelector('form').submit();

Now incorporate this into the request:

from spider import Spider

app = Spider(api_key='your_api_key')

execution_scripts = {
    "http://www.example.com/login": `
    document.addEventListener("DOMContentLoaded", function() {
        document.querySelector("#username").value = "your_username";
        document.querySelector("#password").value = "your_password";
        document.querySelector("form").submit();
    });
    `
}

json_data = {
    "return_cookies": True,
    "return_headers": True,
    "request": "chrome",
    "execution_scripts": execution_scripts,
    "return_format": "markdown",
    "url": "http://www.example.com/login"
}

response = app.crawl_url(json_data['url'], params=json_data)
print(response)

Using GPT Config for AI-Driven Actions

You can use the gpt_config option to run custom AI actions that interact with the page, filling in forms or navigating through login sequences using models like GPT-4. If your account does not have at least $50 in credits, you will need to provide your own OpenAI API key.

from spider import Spider

app = Spider(api_key='your_api_key')

gpt_config = {
    "prompt": ["login with the username 'henry@mail.com' and password `something`.", "extract the main article"],
    "model": "gpt-4o",
    "max_tokens": 2000,
    "temperature": 0.54,
    "top_p": 0.17,
    "api_key": None  # your_openai_api_key is only necessary if your credits are under $50
}

json_data = {
    "url": "http://www.example.com/authenticated-page",
    "gpt_config": gpt_config,
    "return_format": "markdown"
}

response = app.crawl_url(json_data['url'], params=json_data)
print(response)

GPT Configs Structure

GPT configs in JSON format for your reference:

{
	"prompt": ["login with the username 'henry@mail.com' and password `something`.", "extract the main article"],
	"model": "gpt-4o",
	"max_tokens": 2000,
	"temperature": 0.54,
	"user": null,
	"top_p": 0.17,
	"prompt_url_map": null,
	"extra_ai_data": true,
	"paths_map": true,
	"screenshot": false,
	"api_key": "your_openai_api_key",
	"cache": null
}

Additional Parameters

Spider provides additional parameters to control the behavior of your crawl. Here are some useful ones:

wait_for: Specifies various waiting conditions.
blacklist: Blacklist certain paths from being crawled.
whitelist: Whitelist certain paths for crawling.
subdomains: Allow subdomains to be included.
user_agent: Set a custom user agent.
fingerprint: Use advanced fingerprinting for Chrome.
storageless: Disable storing of crawled data.
readability: Pre-process the content for reading.
chunking_alg: Segment your content output.

For a full list of parameters, refer to the Spider API Reference.

Conclusion

Using the Spider API, you can effectively handle authenticated pages by setting cookies directly, running custom JavaScript, or using AI-driven actions. With these methods, you can ensure your crawling tasks access the necessary protected resources.

Empower any project with AI-ready data for LLMs

Get started

Crawling Authenticated Pages with Spider

Table of Contents

Table of Contents

Prerequisites

Setting Up Your Environment

Using the Spider API

Directly Setting the Cookie

Example

Using execution_scripts to Run Custom JavaScript

Example

Using GPT Config for AI-Driven Actions

GPT Configs Structure

Additional Parameters

Conclusion

Empower any project with AI-ready data for LLMs

Using `execution_scripts` to Run Custom JavaScript