Crawling Authenticated Pages
Some pages require authentication before you can access their content. Spider supports two methods for handling login-protected pages:
- Cookies: Pass a session cookie directly in the request.
- Execution scripts: Run custom JavaScript to log in through a form.
Setting a Cookie
If you already have a valid session cookie (from your browser dev tools or a login API), pass it with the cookies parameter:
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://example.com/dashboard",
"return_format": "markdown",
"cookies": "session_id=abc123; auth_token=xyz789",
"request": "chrome"
}
)
for page in response.json():
print(page['url'], len(page.get('content', '')), 'chars')To get the cookie value, log into the site in your browser, open dev tools (F12), go to the Application tab, and copy the relevant cookies.
Using Execution Scripts
The execution_scripts parameter runs JavaScript on the page before Spider extracts content. This is useful for filling in login forms programmatically.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
login_script = """
document.querySelector('#username').value = 'your_username';
document.querySelector('#password').value = 'your_password';
document.querySelector('form').submit();
"""
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://example.com/login",
"return_format": "markdown",
"request": "chrome",
"return_cookies": True,
"execution_scripts": {
"https://example.com/login": login_script
}
}
)
print(response.json())The script runs after the page loads in headless Chrome. After the form submits, Spider follows the redirect and crawls the authenticated pages. Set return_cookies: true to capture the session cookies for subsequent requests.
Tips
- Use Chrome mode: Authentication flows require JavaScript rendering. Always set
request: "chrome". - Cookie format: Pass cookies as a semicolon-separated string, matching the format from browser dev tools.
- wait_for: If the login redirects to a page that loads dynamically, use
wait_forto wait for a specific element before extraction. - Scope with whitelist/blacklist: After login, control which paths Spider crawls using the
whitelistandblacklistparameters.
For all available parameters, see the API reference.
Start crawling in 30 seconds.
One API key. No servers to manage.
Free balance on signup ยท No card required