Spider Blog - Web Scraping for AI Training Data: Legal and Technical Guide 2026

Every large language model, every RAG pipeline, every fine-tuned classifier starts with data. Most of that data comes from the web. The question is no longer whether scraping is useful for AI. It’s whether you can do it legally, responsibly, and at scale without exposing your organization to regulatory action or litigation.

This guide covers both sides: the legal framework as it stands in early 2026, and the technical architecture for building compliant scraping pipelines that produce high-quality training data.

Disclaimer: This post is informational and does not constitute legal advice. The legal landscape around AI training data is evolving rapidly across jurisdictions. Consult qualified legal counsel for decisions about your specific use case.

Part 1: The Legal Landscape

EU AI Act: Training Data Transparency

The EU AI Act, which entered its phased enforcement period in 2025, imposes direct obligations on anyone building or deploying AI systems within the EU. For training data, the requirements center on transparency and documentation.

Key obligations for general-purpose AI models (Article 53):

Maintain detailed documentation of training data sources, including the domains scraped, the time period of collection, and the volume of data gathered.
Provide a “sufficiently detailed summary” of training data content, published in a format the AI Office specifies.
Implement a policy to comply with EU copyright law, specifically the text and data mining (TDM) opt-out mechanism under the Digital Single Market Directive (Article 4).

The TDM opt-out is critical. Under EU law, rights holders can reserve their content from text and data mining by expressing that reservation “in an appropriate manner.” In practice, this means checking robots.txt for TDM-specific directives, honoring meta tags like <meta name="robots" content="noai">, and respecting HTTP headers such as X-Robots-Tag: noai. If a site opts out, scraping that content for training purposes creates legal exposure under EU copyright law, regardless of where your servers are located, as long as your model serves EU users.

What this means for your pipeline: You need provenance tracking. Every document in your training set should have a record of where it came from, when it was collected, and whether the source had an active TDM opt-out at the time of collection. Retroactive compliance is expensive. Build this from day one.

US Fair Use Doctrine

US law takes a different approach. There is no comprehensive AI training data statute. Instead, the legality of scraping for AI training rests primarily on the fair use doctrine under copyright law and on the Computer Fraud and Abuse Act (CFAA) for access-related claims.

hiQ Labs v. LinkedIn (2022): The Ninth Circuit held that scraping publicly available data does not violate the CFAA. LinkedIn could not use the CFAA to block hiQ from collecting public profile data. This case established that accessing publicly available web pages, even at scale, is not “unauthorized access” under federal computer fraud law. However, the ruling is narrow: it applies to data that is genuinely public (no login required) and does not address copyright.

Clearview AI: Multiple lawsuits and regulatory actions against Clearview AI for scraping public photos to build a facial recognition database. The Illinois BIPA case resulted in a settlement, and several countries (including the UK and Australia) found Clearview’s practices violated privacy law. The lesson: even if the data is technically public, biometric data and personal images carry elevated legal risk.

Fair use factors for AI training (17 U.S.C. 107):

Purpose and character of use. Commercial use weighs against fair use, but transformative use (creating something fundamentally new) weighs in favor. Training a model that generates novel outputs from millions of sources has a strong transformative argument. The Southern District of New York’s 2023 ruling in Thomson Reuters v. Ross Intelligence rejected fair use for a legal AI product, but the facts were narrow. The ongoing NYT v. OpenAI case will likely produce the most significant guidance.
Nature of the copyrighted work. Factual content (news articles, encyclopedias, technical documentation) receives less copyright protection than creative works (novels, poetry, visual art). Scraping factual content for training is on firmer legal ground.
Amount used. Training on the entirety of a copyrighted work weighs against fair use, but courts have recognized that some uses require the whole work (Google Books, for example).
Effect on the market. If the trained model competes directly with the scraped source (a news summarizer that replaces the original article), this factor weighs heavily against fair use.

Practical takeaway: US law is unsettled. The safest position is to document your fair use rationale, avoid scraping content that your model will directly reproduce or compete with, and maintain records that demonstrate transformative use.

robots.txt: Legal Status and Best Practices

robots.txt is a protocol, not a law. No US or EU statute makes robots.txt legally binding on its own. However, its legal significance has grown.

Where robots.txt matters legally:

EU TDM opt-out. The EU AI Act and the DSM Directive recognize robots.txt as a valid mechanism for rights holders to opt out of text and data mining. Ignoring a TDM-related robots.txt directive creates liability under EU copyright law.
Contract law. Some courts have treated robots.txt as part of a site’s terms of use. If a site’s ToS says “you agree to follow robots.txt,” violating it becomes a breach of contract claim.
Good faith evidence. Even where robots.txt is not legally binding, respecting it demonstrates good faith. In litigation, showing that your scraper honored robots.txt strengthens your position.

Best practices:

Always fetch and parse robots.txt before crawling a domain.
Respect Disallow directives for your user-agent and for the wildcard * agent.
Check for TDM-specific directives (Disallow entries for AI/ML bots, or custom fields like X-Robots-Tag).
Cache robots.txt per domain and refresh it periodically (every 24 hours is standard).
Log the robots.txt content at the time of crawl for your provenance records.

Terms of Service: Contractual vs. Technical Restrictions

Website terms of service (ToS) can create contractual obligations that go beyond what robots.txt covers. However, the enforceability of ToS against scrapers depends on how the terms are presented.

Browsewrap agreements (terms accessible via a footer link, no affirmative acceptance required) are generally difficult to enforce against automated scrapers. Courts have repeatedly held that merely visiting a website does not constitute acceptance of browsewrap terms, especially for bots that never render the page.

Clickwrap agreements (requiring a click or account creation to accept terms) are more enforceable. If your scraper creates an account, checks a box, or logs in, the ToS likely binds you.

Key considerations:

Scraping publicly accessible pages without logging in rarely triggers enforceable ToS obligations.
Scraping behind a login, or after account creation, likely does.
Some ToS explicitly prohibit scraping. While enforcement is uncertain for browsewrap, the existence of such terms increases litigation risk.
If a site sends you a cease-and-desist letter, take it seriously. Continued scraping after notice strengthens their legal position.

If you scrape personal data from EU-based websites, GDPR applies. “Personal data” is broad: names, email addresses, IP addresses, photos, usernames, and any information that can identify a natural person.

Practical GDPR requirements for scraping:

Lawful basis. You need one. The most plausible basis for scraping is “legitimate interest” (Article 6(1)(f)), but this requires a balancing test: your interest in the data must not override the data subjects’ rights. Scraping and publishing personal data in an AI model’s outputs will almost certainly fail this test.
Data minimization. Collect only what you need. If you’re building a language model, you likely don’t need names, emails, or phone numbers. Strip PII before it enters your training pipeline.
Transparency. GDPR requires you to inform data subjects that you’re processing their data. This is nearly impossible to do at scale for scraped data. The practical mitigation is to remove PII so that GDPR’s personal data provisions don’t apply to your training set.
Data subject rights. Individuals can request deletion of their data. If personal data ends up in your training set and someone exercises their right to erasure, you need a mechanism to comply.

Bottom line: The most defensible approach is to aggressively filter PII from scraped content before it enters your training pipeline. If your training data contains no personal data, most GDPR obligations fall away.

Copyright: Facts vs. Expression

Copyright protects original expression, not facts. This distinction is central to the legality of scraping for AI training.

What is not copyrightable:

Facts, data points, statistics, and measurements.
Ideas, concepts, and methods.
Titles, names, short phrases (generally).
Government works (in the US, federal government publications are public domain).

What is copyrightable:

The specific way an author expresses facts (word choice, sentence structure, narrative arc).
Original creative works: fiction, poetry, art, music, photography.
Compilations of facts, if the selection and arrangement are original (Feist v. Rural Telephone).

Transformative use and AI training: The strongest legal argument for scraping copyrighted content to train AI is that the model learns statistical patterns from the text rather than copying the expression itself. The model’s outputs are new text, not reproductions of training data. This is a transformative use argument. It’s compelling but untested at the Supreme Court level. The safest approach is to focus on factual, technical, and publicly licensed content where possible, and to document your transformative use rationale for everything else.

Part 2: Building Compliant Scraping Pipelines

Architecture Overview

A compliant AI training data pipeline has four stages: collection, cleaning, documentation, and storage. Each stage carries legal and technical requirements.

URLs -> robots.txt check -> rate-limited fetch -> raw content
  -> PII removal -> deduplication -> quality filtering
  -> provenance tagging -> data card generation
  -> clean training data (markdown/JSONL)

Respecting robots.txt Programmatically

Your scraper must fetch and parse robots.txt before making any requests to a domain. This is not optional for compliance.

Implementation requirements:

Fetch https://example.com/robots.txt before any crawl of example.com.
Parse User-agent, Disallow, Allow, and Crawl-delay directives.
Match your bot’s user-agent against the rules. If no specific rule matches, fall back to the * wildcard rules.
Respect Crawl-delay if specified. This is the minimum interval between requests.
Handle missing robots.txt (HTTP 404) as “everything allowed.”
Handle server errors (HTTP 5xx) conservatively: either wait and retry, or treat the domain as fully disallowed.

Spider handles all of this automatically. The crawl engine fetches, parses, and caches robots.txt for every domain, and enforces Disallow and Crawl-delay directives before any page request is made.

Rate Limiting

Aggressive scraping can degrade a target site’s performance, trigger IP bans, and create legal exposure (potential tortious interference or CFAA claims for exceeding authorized access).

Best practices:

Respect Crawl-delay in robots.txt.
If no Crawl-delay is specified, use a conservative default (1 to 2 seconds between requests to the same domain).
Implement per-domain rate limiting, not just global rate limiting. Hitting 100 different domains at 10 requests per second each is fine. Hitting one domain at 1,000 requests per second is not.
Use exponential backoff on HTTP 429 (Too Many Requests) and 503 (Service Unavailable) responses.
Identify your bot with a descriptive User-Agent string that includes contact information or a URL where site operators can learn about your crawler.

Spider enforces per-domain rate limiting by default. The engine tracks request timing per domain and throttles automatically based on robots.txt directives and server response codes. You don’t need to implement this yourself.

Identifying Your Bot

Transparent bot identification is both a legal best practice and a technical courtesy.

Set a User-Agent header that identifies your organization and purpose. Example: MyCompanyBot/1.0 (https://mycompany.com/bot; bot@mycompany.com).
Include a URL where site operators can find your crawling policy and contact you.
If you’re using a scraping service like Spider, the requests are made through Spider’s infrastructure with Spider’s bot identification. This is a feature, not a limitation: Spider’s bot is already recognized by major sites, and its reputation for respectful crawling reduces block rates.

Data Cleaning for Training

Raw scraped content is not training data. The gap between “HTML from the web” and “clean text suitable for model training” is where most of the engineering effort lives.

Deduplication

Web content is heavily duplicated. Boilerplate headers, footers, navigation bars, cookie banners, and syndicated content appear across thousands of pages. Training on duplicate data wastes compute, biases the model toward overrepresented content, and inflates dataset size without adding information.

Deduplication strategies:

Exact deduplication. Hash each document (SHA-256 of the normalized text) and remove exact matches. Fast and easy, but misses near-duplicates.
Near-duplicate detection. Use MinHash with Locality-Sensitive Hashing (LSH) to identify documents that are substantially similar (e.g., same article syndicated across multiple news sites with minor formatting differences). A Jaccard similarity threshold of 0.8 to 0.9 works well for most training data.
Paragraph-level deduplication. For large corpora, deduplicate at the paragraph level to remove boilerplate that appears across many pages (navigation text, legal disclaimers, cookie notices) even when the surrounding content is unique.

PII Removal

Removing personally identifiable information is both a GDPR requirement and a model safety practice. PII in training data can lead to memorization, where the model reproduces personal information in its outputs.

PII removal pipeline:

Named entity recognition (NER). Use a pre-trained NER model to identify person names, organizations, locations, and other entities.
Pattern matching. Regex-based detection for structured PII: email addresses, phone numbers, social security numbers, credit card numbers, IP addresses.
Replacement strategy. Replace detected PII with placeholder tokens ([EMAIL], [PHONE], [NAME]) rather than deleting it. This preserves sentence structure for training while removing the sensitive content.
Validation. Run a second pass with a different detection method to catch what the first pass missed. No single approach catches everything.

Quality Filtering

Not all web content is useful for training. Quality filtering removes content that would degrade model performance.

Common quality signals:

Signal	What it catches	Typical threshold
Document length	Stub pages, error pages, empty templates	Minimum 50 to 100 words
Language detection	Pages in languages outside your target set	fastText LID confidence > 0.8
Perplexity scoring	Machine-generated spam, keyword-stuffed pages	Remove top 5% highest perplexity
Character ratio	Pages that are mostly code, markup, or special characters	Alphabetic character ratio > 0.7
Repetition ratio	Pages with excessive repeated phrases or paragraphs	Duplicate n-gram ratio < 0.3
Adult/toxic content	Content that would make the model produce harmful outputs	Classifier-based filtering

Documentation Requirements

The EU AI Act requires documentation. Even outside the EU, good documentation protects you legally, helps reproduce your results, and makes audits tractable.

Provenance Tracking

Every document in your training set should have a metadata record:

{
  "source_url": "https://example.com/article/12345",
  "domain": "example.com",
  "crawl_timestamp": "2026-02-15T08:30:00Z",
  "robots_txt_status": "allowed",
  "robots_txt_snapshot": "sha256:abc123...",
  "http_status": 200,
  "content_type": "text/html",
  "content_hash": "sha256:def456...",
  "license_detected": "CC-BY-4.0",
  "pii_removed": true,
  "pii_removal_method": "ner+regex_v2",
  "quality_score": 0.87,
  "word_count": 1423,
  "language": "en",
  "tdm_opt_out": false
}

This metadata serves multiple purposes: legal compliance (EU AI Act Article 53), reproducibility (you can re-crawl or verify any document), and filtering (you can remove entire domains or time ranges from your dataset without reprocessing everything).

Data Cards

A data card is a standardized document that describes your training dataset. It should include:

Dataset name and version.
Collection methodology. How was the data gathered? What tools were used? What filters were applied?
Source distribution. How many domains? What categories of content? What languages?
Temporal coverage. When was the data collected? Over what time period?
Known limitations. What biases exist in the data? What content types are overrepresented or underrepresented?
PII handling. What PII detection and removal methods were used?
Legal basis. What is the legal justification for the data collection? How were opt-outs handled?
Contact information. Who is responsible for the dataset?

Format Optimization: Markdown Over HTML

If you’re scraping web content for AI training, the format of your cleaned output matters more than most teams realize.

Raw HTML is a poor training format. A typical web page is 60% to 80% markup by character count: <div> tags, CSS classes, inline styles, script blocks, SVG paths, tracking pixels, and ad containers. All of that markup consumes tokens during training without contributing meaningful semantic content. Training on raw HTML teaches the model to generate HTML, which is rarely what you want.

Markdown preserves structure without noise. Converting HTML to clean markdown strips all presentational markup while preserving the semantic structure that matters: headings, paragraphs, lists, tables, links, code blocks, and emphasis. The result is typically 70% to 90% smaller by token count than the source HTML.

Concrete comparison:

Metric	Raw HTML	Clean Markdown
Typical token count (1 page)	3,000 to 15,000	500 to 2,000
Semantic signal ratio	20% to 40%	85% to 95%
Training compute per page	High	Low
Noise (nav, ads, boilerplate)	Present	Removed

Spider returns clean markdown by default when you set return_format: "markdown". The conversion strips navigation, ads, footers, cookie banners, and boilerplate, then outputs well-structured markdown with headings, lists, and tables preserved. For training data pipelines, this means you can go directly from Spider’s API response to your training data store with minimal post-processing.

Part 3: How Spider Fits Into a Compliant Pipeline

Spider is a scraping engine, not a compliance tool. It does not track TDM opt-out signals in HTML meta tags (<meta name="robots" content="noai">) or HTTP response headers (X-Robots-Tag: noai), does not perform PII removal, and does not generate provenance metadata beyond standard HTTP response fields. What it does provide is robots.txt enforcement, per-domain rate limiting, and clean markdown output, which covers the collection stage of a compliant pipeline. You still need to build the filtering, documentation, and cleaning stages yourself.

Built-in robots.txt Compliance

Every crawl begins with a robots.txt fetch. The engine parses User-agent, Disallow, Allow, and Crawl-delay directives and enforces them before any page request. This is enabled by default. Note that robots.txt enforcement covers path-level blocking only. It does not detect TDM opt-out signals in HTML meta tags or HTTP headers. If your compliance requirements include those signals, you need a post-fetch filtering step.

Automatic Rate Limiting

Spider enforces per-domain rate limiting based on Crawl-delay directives in robots.txt. When no Crawl-delay is specified, the engine applies conservative defaults to avoid overloading target servers. HTTP 429 and 503 responses trigger automatic backoff. The result is that your crawl respects the target site’s capacity without you writing any throttling logic.

Clean Markdown Output

Set return_format: "markdown" and Spider returns content with HTML markup, navigation, ads, and boilerplate removed. The output is ready for tokenization and training with minimal post-processing. For teams building training datasets, this eliminates the entire HTML-to-text conversion pipeline.

Metadata for Provenance

Every Spider API response includes metadata: the source URL, HTTP status code, content type, and response headers. This metadata feeds directly into your provenance tracking system. Combined with the timestamp of your API call and the robots.txt status (which Spider enforces), you have the core fields needed for EU AI Act documentation.

Proxy Infrastructure That Respects Target Sites

Spider routes requests through datacenter, residential, and mobile proxies. The proxy layer handles IP rotation, geographic targeting, and anti-bot bypass. Critically, the proxy infrastructure is designed for sustainable scraping: requests are distributed across IPs to avoid concentrating load on any single target, and retry logic respects server signals (429s, 503s, connection resets) rather than hammering through them.

Compliance Checklist

Use this checklist before starting any web scraping project for AI training data.

Category	Requirement	Status
robots.txt	Fetch and parse robots.txt before crawling each domain	Required
robots.txt	Respect Disallow directives for your user-agent	Required
robots.txt	Respect Crawl-delay directives	Required
robots.txt	Log robots.txt content at time of crawl	Recommended
TDM opt-out	Check for TDM opt-out signals (robots.txt, meta tags, HTTP headers)	Required (EU)
TDM opt-out	Exclude opted-out content from training data	Required (EU)
Rate limiting	Implement per-domain rate limiting	Required
Rate limiting	Honor HTTP 429/503 with exponential backoff	Required
Bot identification	Set descriptive User-Agent with contact info	Recommended
Terms of service	Avoid scraping behind login walls unless ToS permits it	Recommended
Terms of service	Honor cease-and-desist requests	Required
PII removal	Run NER and pattern-based PII detection	Required (GDPR)
PII removal	Replace PII with placeholder tokens	Required (GDPR)
PII removal	Validate with a second detection pass	Recommended
Deduplication	Exact deduplication (hash-based)	Recommended
Deduplication	Near-duplicate detection (MinHash/LSH)	Recommended
Quality filtering	Filter by document length, language, and quality signals	Recommended
Documentation	Maintain provenance metadata for every document	Required (EU AI Act)
Documentation	Create and publish a data card for the dataset	Required (EU AI Act)
Documentation	Document fair use rationale (US)	Recommended
Copyright	Assess transformative use for copyrighted content	Recommended
Copyright	Prefer factual, technical, and openly licensed content	Recommended
Format	Convert to markdown or plain text before training	Recommended

Looking Ahead

The legal ground under AI training data is shifting fast. The EU AI Act enforcement timeline continues through 2026 and 2027, with new obligations taking effect at each phase. In the US, the NYT v. OpenAI case and several other pending lawsuits will produce precedent that could reshape fair use analysis for AI training. Multiple states are considering their own AI legislation, and the Copyright Office has signaled interest in rulemaking.

The teams that will navigate this well are the ones building compliance into their data pipelines now, not retrofitting it after a regulatory action or lawsuit. Provenance tracking, PII removal, robots.txt respect, and proper documentation are not overhead. They’re the foundation that lets you keep training models as the legal requirements tighten. The technical requirements are all solvable today. The legal questions are harder, and they require real legal counsel, not a blog post.

Get web data insights

Weekly tips on web scraping, AI pipelines, and product updates.

Web Scraping for AI Training Data: Legal and Technical Guide 2026