Skip to main content
Guides / Crawling Authenticated Pages

Crawling Authenticated Pages

Two methods for crawling pages behind login walls: cookies and execution scripts.

2 min read Jeff Mendez

Crawling Authenticated Pages

Some pages require authentication before you can access their content. Spider supports two methods for handling login-protected pages:

  1. Cookies: Pass a session cookie directly in the request.
  2. Execution scripts: Run custom JavaScript to log in through a form.

If you already have a valid session cookie (from your browser dev tools or a login API), pass it with the cookies parameter:

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://example.com/dashboard",
    "return_format": "markdown",
    "cookies": "session_id=abc123; auth_token=xyz789",
    "request": "chrome"
  }
)

for page in response.json():
    print(page['url'], len(page.get('content', '')), 'chars')

To get the cookie value, log into the site in your browser, open dev tools (F12), go to the Application tab, and copy the relevant cookies.

Using Execution Scripts

The execution_scripts parameter runs JavaScript on the page before Spider extracts content. This is useful for filling in login forms programmatically.

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

login_script = """
document.querySelector('#username').value = 'your_username';
document.querySelector('#password').value = 'your_password';
document.querySelector('form').submit();
"""

response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://example.com/login",
    "return_format": "markdown",
    "request": "chrome",
    "return_cookies": True,
    "execution_scripts": {
      "https://example.com/login": login_script
    }
  }
)

print(response.json())

The script runs after the page loads in headless Chrome. After the form submits, Spider follows the redirect and crawls the authenticated pages. Set return_cookies: true to capture the session cookies for subsequent requests.

Tips

  • Use Chrome mode: Authentication flows require JavaScript rendering. Always set request: "chrome".
  • Cookie format: Pass cookies as a semicolon-separated string, matching the format from browser dev tools.
  • wait_for: If the login redirects to a page that loads dynamically, use wait_for to wait for a specific element before extraction.
  • Scope with whitelist/blacklist: After login, control which paths Spider crawls using the whitelist and blacklist parameters.

For all available parameters, see the API reference.

Get started

Start crawling in 30 seconds.

One API key. Immediate results. No servers to manage.

Free credits on signup. No card required.

Get started freeRead the docs