Web Scraping for AI Training Data: Legal and Technical Guide 2026
Every large language model, every RAG pipeline, every fine-tuned classifier starts with data. Most of that data comes from the web. The question is no longer whether scraping is useful for AI. It’s whether you can do it legally, responsibly, and at scale without exposing your organization to regulatory action or litigation.
This guide covers both sides: the legal framework as it stands in early 2026, and the technical architecture for building compliant scraping pipelines that produce high-quality training data.
Disclaimer: This post is informational and does not constitute legal advice. The legal landscape around AI training data is evolving rapidly across jurisdictions. Consult qualified legal counsel for decisions about your specific use case.
Part 1: The Legal Landscape
EU AI Act: Training Data Transparency
The EU AI Act, which entered its phased enforcement period in 2025, imposes direct obligations on anyone building or deploying AI systems within the EU. For training data, the requirements center on transparency and documentation.
Key obligations for general-purpose AI models (Article 53):
- Maintain detailed documentation of training data sources, including the domains scraped, the time period of collection, and the volume of data gathered.
- Provide a “sufficiently detailed summary” of training data content, published in a format the AI Office specifies.
- Implement a policy to comply with EU copyright law, specifically the text and data mining (TDM) opt-out mechanism under the Digital Single Market Directive (Article 4).
The TDM opt-out is critical. Under EU law, rights holders can reserve their content from text and data mining by expressing that reservation “in an appropriate manner.” In practice, this means checking robots.txt for TDM-specific directives, honoring meta tags like <meta name="robots" content="noai">, and respecting HTTP headers such as X-Robots-Tag: noai. If a site opts out, scraping that content for training purposes creates legal exposure under EU copyright law, regardless of where your servers are located, as long as your model serves EU users.
What this means for your pipeline: You need provenance tracking. Every document in your training set should have a record of where it came from, when it was collected, and whether the source had an active TDM opt-out at the time of collection. Retroactive compliance is expensive. Build this from day one.
US Fair Use Doctrine
US law takes a different approach. There is no comprehensive AI training data statute. Instead, the legality of scraping for AI training rests primarily on the fair use doctrine under copyright law and on the Computer Fraud and Abuse Act (CFAA) for access-related claims.
hiQ Labs v. LinkedIn (2022): The Ninth Circuit held that scraping publicly available data does not violate the CFAA. LinkedIn could not use the CFAA to block hiQ from collecting public profile data. This case established that accessing publicly available web pages, even at scale, is not “unauthorized access” under federal computer fraud law. However, the ruling is narrow: it applies to data that is genuinely public (no login required) and does not address copyright.
Clearview AI: Multiple lawsuits and regulatory actions against Clearview AI for scraping public photos to build a facial recognition database. The Illinois BIPA case resulted in a settlement, and several countries (including the UK and Australia) found Clearview’s practices violated privacy law. The lesson: even if the data is technically public, biometric data and personal images carry elevated legal risk.
Fair use factors for AI training (17 U.S.C. 107):
- Purpose and character of use. Commercial use weighs against fair use, but transformative use (creating something fundamentally new) weighs in favor. Training a model that generates novel outputs from millions of sources has a strong transformative argument. The Southern District of New York’s 2023 ruling in Thomson Reuters v. Ross Intelligence rejected fair use for a legal AI product, but the facts were narrow. The ongoing NYT v. OpenAI case will likely produce the most significant guidance.
- Nature of the copyrighted work. Factual content (news articles, encyclopedias, technical documentation) receives less copyright protection than creative works (novels, poetry, visual art). Scraping factual content for training is on firmer legal ground.
- Amount used. Training on the entirety of a copyrighted work weighs against fair use, but courts have recognized that some uses require the whole work (Google Books, for example).
- Effect on the market. If the trained model competes directly with the scraped source (a news summarizer that replaces the original article), this factor weighs heavily against fair use.
Practical takeaway: US law is unsettled. The safest position is to document your fair use rationale, avoid scraping content that your model will directly reproduce or compete with, and maintain records that demonstrate transformative use.
robots.txt: Legal Status and Best Practices
robots.txt is a protocol, not a law. No US or EU statute makes robots.txt legally binding on its own. However, its legal significance has grown.
Where robots.txt matters legally:
- EU TDM opt-out. The EU AI Act and the DSM Directive recognize robots.txt as a valid mechanism for rights holders to opt out of text and data mining. Ignoring a TDM-related robots.txt directive creates liability under EU copyright law.
- Contract law. Some courts have treated robots.txt as part of a site’s terms of use. If a site’s ToS says “you agree to follow robots.txt,” violating it becomes a breach of contract claim.
- Good faith evidence. Even where robots.txt is not legally binding, respecting it demonstrates good faith. In litigation, showing that your scraper honored robots.txt strengthens your position.
Best practices:
- Always fetch and parse robots.txt before crawling a domain.
- Respect
Disallowdirectives for your user-agent and for the wildcard*agent. - Check for TDM-specific directives (
Disallowentries for AI/ML bots, or custom fields likeX-Robots-Tag). - Cache robots.txt per domain and refresh it periodically (every 24 hours is standard).
- Log the robots.txt content at the time of crawl for your provenance records.
Terms of Service: Contractual vs. Technical Restrictions
Website terms of service (ToS) can create contractual obligations that go beyond what robots.txt covers. However, the enforceability of ToS against scrapers depends on how the terms are presented.
Browsewrap agreements (terms accessible via a footer link, no affirmative acceptance required) are generally difficult to enforce against automated scrapers. Courts have repeatedly held that merely visiting a website does not constitute acceptance of browsewrap terms, especially for bots that never render the page.
Clickwrap agreements (requiring a click or account creation to accept terms) are more enforceable. If your scraper creates an account, checks a box, or logs in, the ToS likely binds you.
Key considerations:
- Scraping publicly accessible pages without logging in rarely triggers enforceable ToS obligations.
- Scraping behind a login, or after account creation, likely does.
- Some ToS explicitly prohibit scraping. While enforcement is uncertain for browsewrap, the existence of such terms increases litigation risk.
- If a site sends you a cease-and-desist letter, take it seriously. Continued scraping after notice strengthens their legal position.
GDPR Considerations for Scraping EU Sites
If you scrape personal data from EU-based websites, GDPR applies. “Personal data” is broad: names, email addresses, IP addresses, photos, usernames, and any information that can identify a natural person.
Practical GDPR requirements for scraping:
- Lawful basis. You need one. The most plausible basis for scraping is “legitimate interest” (Article 6(1)(f)), but this requires a balancing test: your interest in the data must not override the data subjects’ rights. Scraping and publishing personal data in an AI model’s outputs will almost certainly fail this test.
- Data minimization. Collect only what you need. If you’re building a language model, you likely don’t need names, emails, or phone numbers. Strip PII before it enters your training pipeline.
- Transparency. GDPR requires you to inform data subjects that you’re processing their data. This is nearly impossible to do at scale for scraped data. The practical mitigation is to remove PII so that GDPR’s personal data provisions don’t apply to your training set.
- Data subject rights. Individuals can request deletion of their data. If personal data ends up in your training set and someone exercises their right to erasure, you need a mechanism to comply.
Bottom line: The most defensible approach is to aggressively filter PII from scraped content before it enters your training pipeline. If your training data contains no personal data, most GDPR obligations fall away.
Copyright: Facts vs. Expression
Copyright protects original expression, not facts. This distinction is central to the legality of scraping for AI training.
What is not copyrightable:
- Facts, data points, statistics, and measurements.
- Ideas, concepts, and methods.
- Titles, names, short phrases (generally).
- Government works (in the US, federal government publications are public domain).
What is copyrightable:
- The specific way an author expresses facts (word choice, sentence structure, narrative arc).
- Original creative works: fiction, poetry, art, music, photography.
- Compilations of facts, if the selection and arrangement are original (Feist v. Rural Telephone).
Transformative use and AI training: The strongest legal argument for scraping copyrighted content to train AI is that the model learns statistical patterns from the text rather than copying the expression itself. The model’s outputs are new text, not reproductions of training data. This is a transformative use argument. It’s compelling but untested at the Supreme Court level. The safest approach is to focus on factual, technical, and publicly licensed content where possible, and to document your transformative use rationale for everything else.
Part 2: Building Compliant Scraping Pipelines
Architecture Overview
A compliant AI training data pipeline has four stages: collection, cleaning, documentation, and storage. Each stage carries legal and technical requirements.
URLs -> robots.txt check -> rate-limited fetch -> raw content
-> PII removal -> deduplication -> quality filtering
-> provenance tagging -> data card generation
-> clean training data (markdown/JSONL)
Respecting robots.txt Programmatically
Your scraper must fetch and parse robots.txt before making any requests to a domain. This is not optional for compliance.
Implementation requirements:
- Fetch
https://example.com/robots.txtbefore any crawl ofexample.com. - Parse
User-agent,Disallow,Allow, andCrawl-delaydirectives. - Match your bot’s user-agent against the rules. If no specific rule matches, fall back to the
*wildcard rules. - Respect
Crawl-delayif specified. This is the minimum interval between requests. - Handle missing robots.txt (HTTP 404) as “everything allowed.”
- Handle server errors (HTTP 5xx) conservatively: either wait and retry, or treat the domain as fully disallowed.
Spider handles all of this automatically. The crawl engine fetches, parses, and caches robots.txt for every domain, and enforces Disallow and Crawl-delay directives before any page request is made.
Rate Limiting
Aggressive scraping can degrade a target site’s performance, trigger IP bans, and create legal exposure (potential tortious interference or CFAA claims for exceeding authorized access).
Best practices:
- Respect
Crawl-delayin robots.txt. - If no
Crawl-delayis specified, use a conservative default (1 to 2 seconds between requests to the same domain). - Implement per-domain rate limiting, not just global rate limiting. Hitting 100 different domains at 10 requests per second each is fine. Hitting one domain at 1,000 requests per second is not.
- Use exponential backoff on HTTP 429 (Too Many Requests) and 503 (Service Unavailable) responses.
- Identify your bot with a descriptive
User-Agentstring that includes contact information or a URL where site operators can learn about your crawler.
Spider enforces per-domain rate limiting by default. The engine tracks request timing per domain and throttles automatically based on robots.txt directives and server response codes. You don’t need to implement this yourself.
Identifying Your Bot
Transparent bot identification is both a legal best practice and a technical courtesy.
- Set a
User-Agentheader that identifies your organization and purpose. Example:MyCompanyBot/1.0 (https://mycompany.com/bot; bot@mycompany.com). - Include a URL where site operators can find your crawling policy and contact you.
- If you’re using a scraping service like Spider, the requests are made through Spider’s infrastructure with Spider’s bot identification. This is a feature, not a limitation: Spider’s bot is already recognized by major sites, and its reputation for respectful crawling reduces block rates.
Data Cleaning for Training
Raw scraped content is not training data. The gap between “HTML from the web” and “clean text suitable for model training” is where most of the engineering effort lives.
Deduplication
Web content is heavily duplicated. Boilerplate headers, footers, navigation bars, cookie banners, and syndicated content appear across thousands of pages. Training on duplicate data wastes compute, biases the model toward overrepresented content, and inflates dataset size without adding information.
Deduplication strategies:
- Exact deduplication. Hash each document (SHA-256 of the normalized text) and remove exact matches. Fast and easy, but misses near-duplicates.
- Near-duplicate detection. Use MinHash with Locality-Sensitive Hashing (LSH) to identify documents that are substantially similar (e.g., same article syndicated across multiple news sites with minor formatting differences). A Jaccard similarity threshold of 0.8 to 0.9 works well for most training data.
- Paragraph-level deduplication. For large corpora, deduplicate at the paragraph level to remove boilerplate that appears across many pages (navigation text, legal disclaimers, cookie notices) even when the surrounding content is unique.
PII Removal
Removing personally identifiable information is both a GDPR requirement and a model safety practice. PII in training data can lead to memorization, where the model reproduces personal information in its outputs.
PII removal pipeline:
- Named entity recognition (NER). Use a pre-trained NER model to identify person names, organizations, locations, and other entities.
- Pattern matching. Regex-based detection for structured PII: email addresses, phone numbers, social security numbers, credit card numbers, IP addresses.
- Replacement strategy. Replace detected PII with placeholder tokens (
[EMAIL],[PHONE],[NAME]) rather than deleting it. This preserves sentence structure for training while removing the sensitive content. - Validation. Run a second pass with a different detection method to catch what the first pass missed. No single approach catches everything.
Quality Filtering
Not all web content is useful for training. Quality filtering removes content that would degrade model performance.
Common quality signals:
| Signal | What it catches | Typical threshold |
|---|---|---|
| Document length | Stub pages, error pages, empty templates | Minimum 50 to 100 words |
| Language detection | Pages in languages outside your target set | fastText LID confidence > 0.8 |
| Perplexity scoring | Machine-generated spam, keyword-stuffed pages | Remove top 5% highest perplexity |
| Character ratio | Pages that are mostly code, markup, or special characters | Alphabetic character ratio > 0.7 |
| Repetition ratio | Pages with excessive repeated phrases or paragraphs | Duplicate n-gram ratio < 0.3 |
| Adult/toxic content | Content that would make the model produce harmful outputs | Classifier-based filtering |
Documentation Requirements
The EU AI Act requires documentation. Even outside the EU, good documentation protects you legally, helps reproduce your results, and makes audits tractable.
Provenance Tracking
Every document in your training set should have a metadata record:
{
"source_url": "https://example.com/article/12345",
"domain": "example.com",
"crawl_timestamp": "2026-02-15T08:30:00Z",
"robots_txt_status": "allowed",
"robots_txt_snapshot": "sha256:abc123...",
"http_status": 200,
"content_type": "text/html",
"content_hash": "sha256:def456...",
"license_detected": "CC-BY-4.0",
"pii_removed": true,
"pii_removal_method": "ner+regex_v2",
"quality_score": 0.87,
"word_count": 1423,
"language": "en",
"tdm_opt_out": false
}
This metadata serves multiple purposes: legal compliance (EU AI Act Article 53), reproducibility (you can re-crawl or verify any document), and filtering (you can remove entire domains or time ranges from your dataset without reprocessing everything).
Data Cards
A data card is a standardized document that describes your training dataset. It should include:
- Dataset name and version.
- Collection methodology. How was the data gathered? What tools were used? What filters were applied?
- Source distribution. How many domains? What categories of content? What languages?
- Temporal coverage. When was the data collected? Over what time period?
- Known limitations. What biases exist in the data? What content types are overrepresented or underrepresented?
- PII handling. What PII detection and removal methods were used?
- Legal basis. What is the legal justification for the data collection? How were opt-outs handled?
- Contact information. Who is responsible for the dataset?
Format Optimization: Markdown Over HTML
If you’re scraping web content for AI training, the format of your cleaned output matters more than most teams realize.
Raw HTML is a poor training format. A typical web page is 60% to 80% markup by character count: <div> tags, CSS classes, inline styles, script blocks, SVG paths, tracking pixels, and ad containers. All of that markup consumes tokens during training without contributing meaningful semantic content. Training on raw HTML teaches the model to generate HTML, which is rarely what you want.
Markdown preserves structure without noise. Converting HTML to clean markdown strips all presentational markup while preserving the semantic structure that matters: headings, paragraphs, lists, tables, links, code blocks, and emphasis. The result is typically 70% to 90% smaller by token count than the source HTML.
Concrete comparison:
| Metric | Raw HTML | Clean Markdown |
|---|---|---|
| Typical token count (1 page) | 3,000 to 15,000 | 500 to 2,000 |
| Semantic signal ratio | 20% to 40% | 85% to 95% |
| Training compute per page | High | Low |
| Noise (nav, ads, boilerplate) | Present | Removed |
Spider returns clean markdown by default when you set return_format: "markdown". The conversion strips navigation, ads, footers, cookie banners, and boilerplate, then outputs well-structured markdown with headings, lists, and tables preserved. For training data pipelines, this means you can go directly from Spider’s API response to your training data store with minimal post-processing.
Part 3: How Spider Fits Into a Compliant Pipeline
Spider is a scraping engine, not a compliance tool. It does not track TDM opt-out signals in HTML meta tags (<meta name="robots" content="noai">) or HTTP response headers (X-Robots-Tag: noai), does not perform PII removal, and does not generate provenance metadata beyond standard HTTP response fields. What it does provide is robots.txt enforcement, per-domain rate limiting, and clean markdown output — which covers the collection stage of a compliant pipeline. You still need to build the filtering, documentation, and cleaning stages yourself.
Built-in robots.txt Compliance
Every crawl begins with a robots.txt fetch. The engine parses User-agent, Disallow, Allow, and Crawl-delay directives and enforces them before any page request. This is enabled by default. Note that robots.txt enforcement covers path-level blocking only. It does not detect TDM opt-out signals in HTML meta tags or HTTP headers. If your compliance requirements include those signals, you need a post-fetch filtering step.
Automatic Rate Limiting
Spider enforces per-domain rate limiting based on Crawl-delay directives in robots.txt. When no Crawl-delay is specified, the engine applies conservative defaults to avoid overloading target servers. HTTP 429 and 503 responses trigger automatic backoff. The result is that your crawl respects the target site’s capacity without you writing any throttling logic.
Clean Markdown Output
Set return_format: "markdown" and Spider returns content with HTML markup, navigation, ads, and boilerplate removed. The output is ready for tokenization and training with minimal post-processing. For teams building training datasets, this eliminates the entire HTML-to-text conversion pipeline.
Metadata for Provenance
Every Spider API response includes metadata: the source URL, HTTP status code, content type, and response headers. This metadata feeds directly into your provenance tracking system. Combined with the timestamp of your API call and the robots.txt status (which Spider enforces), you have the core fields needed for EU AI Act documentation.
Proxy Infrastructure That Respects Target Sites
Spider routes requests through datacenter, residential, and mobile proxies. The proxy layer handles IP rotation, geographic targeting, and anti-bot bypass. Critically, the proxy infrastructure is designed for sustainable scraping: requests are distributed across IPs to avoid concentrating load on any single target, and retry logic respects server signals (429s, 503s, connection resets) rather than hammering through them.
Compliance Checklist
Use this checklist before starting any web scraping project for AI training data.
| Category | Requirement | Status |
|---|---|---|
| robots.txt | Fetch and parse robots.txt before crawling each domain | Required |
| robots.txt | Respect Disallow directives for your user-agent | Required |
| robots.txt | Respect Crawl-delay directives | Required |
| robots.txt | Log robots.txt content at time of crawl | Recommended |
| TDM opt-out | Check for TDM opt-out signals (robots.txt, meta tags, HTTP headers) | Required (EU) |
| TDM opt-out | Exclude opted-out content from training data | Required (EU) |
| Rate limiting | Implement per-domain rate limiting | Required |
| Rate limiting | Honor HTTP 429/503 with exponential backoff | Required |
| Bot identification | Set descriptive User-Agent with contact info | Recommended |
| Terms of service | Avoid scraping behind login walls unless ToS permits it | Recommended |
| Terms of service | Honor cease-and-desist requests | Required |
| PII removal | Run NER and pattern-based PII detection | Required (GDPR) |
| PII removal | Replace PII with placeholder tokens | Required (GDPR) |
| PII removal | Validate with a second detection pass | Recommended |
| Deduplication | Exact deduplication (hash-based) | Recommended |
| Deduplication | Near-duplicate detection (MinHash/LSH) | Recommended |
| Quality filtering | Filter by document length, language, and quality signals | Recommended |
| Documentation | Maintain provenance metadata for every document | Required (EU AI Act) |
| Documentation | Create and publish a data card for the dataset | Required (EU AI Act) |
| Documentation | Document fair use rationale (US) | Recommended |
| Copyright | Assess transformative use for copyrighted content | Recommended |
| Copyright | Prefer factual, technical, and openly licensed content | Recommended |
| Format | Convert to markdown or plain text before training | Recommended |
Looking Ahead
The legal ground under AI training data is shifting fast. The EU AI Act enforcement timeline continues through 2026 and 2027, with new obligations taking effect at each phase. In the US, the NYT v. OpenAI case and several other pending lawsuits will produce precedent that could reshape fair use analysis for AI training. Multiple states are considering their own AI legislation, and the Copyright Office has signaled interest in rulemaking.
The teams that will navigate this well are the ones building compliance into their data pipelines now, not retrofitting it after a regulatory action or lawsuit. Provenance tracking, PII removal, robots.txt respect, and proper documentation are not overhead. They’re the foundation that lets you keep training models as the legal requirements tighten.
The teams that build compliance into their pipelines now will be the ones still training models when the regulatory environment tightens. The technical requirements (provenance tracking, opt-out detection, PII removal, deduplication) are all solvable today. The legal questions are harder, and they require real legal counsel — not a blog post.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.