HTML Transform API
Already have the HTML? Transform it into clean markdown, plain text, or sanitized HTML without re-crawling. Send raw HTML and get back structured, readable content at 0.1 credits per page, the most cost-efficient way to process web content you've already collected.
Why Use Transform Instead of Scrape?
Already Have the HTML
If you've already fetched pages from the web (via your own crawlers, browser extensions, or cached content), Transform lets you convert them without paying for another network request.
Cost Efficient
At just 0.1 credits per HTML document (up to 10 credits for PDFs), Transform is the cheapest way to get clean content. No browser rendering or proxy costs.
Batch Processing
Send an array of HTML documents in one request. Process entire collections of saved pages in a single API call, up to 10 MB total.
Three Cleaning Levels
Standard
Basic HTML-to-format conversion. Preserves all content structure including navigation, footers, and sidebars.
No cleaning flags needed AI Clean
Removes navigation, footers, ads, and boilerplate. Keeps the main article or body content, optimized for feeding into language models.
{"clean": true} Full Clean
Strips all non-essential HTML attributes like classes, IDs, and inline styles. Produces minimal, semantic markup.
{"clean_full": true} Key Capabilities
Readability Extraction
Enable readability to extract just the main content using Mozilla's readability algorithm. Perfect for articles and blog posts.
Multiple Output Formats
Convert to markdown, text, or sanitized html. Markdown is ideal for LLMs; text for NLP; clean HTML for re-rendering.
URL Context
Pass the source URL alongside HTML so relative links can be resolved to absolute URLs in the output. Ensures links in markdown work correctly.
Batch Input
Send an array of {html, url} objects. Transform dozens of pages in a single request to minimize round-trips.
PDF Support
Transform also handles PDF content extraction. Convert PDF documents to markdown or text at up to 10 credits per page.
10 MB Payload
Process up to 10 MB of HTML per request. Large pages, long articles, and complex documents are handled without truncation.
Code Examples
from spider import Spider
client = Spider()
html_content = "<html><body><h1>Hello</h1><p>World</p></body></html>"
result = client.transform(
[{ "html": html_content, "url": "https://example.com" }],
params={
"return_format": "markdown",
"clean": True,
}
)
print(result[0]["content"])
# Output: # Hello
World curl -X POST https://api.spider.cloud/transform \
-H "Authorization: Bearer $SPIDER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"data": [
{"html": "<h1>Page One</h1><p>Content...</p>", "url": "https://example.com/1"},
{"html": "<h1>Page Two</h1><p>Content...</p>", "url": "https://example.com/2"}
],
"return_format": "markdown",
"readability": true
}' Popular Transform Use Cases
Post-Processing Cached Content
You've already saved HTML from your own crawlers or a CDN cache. Transform converts it to clean markdown without consuming browser or proxy credits.
Email & Newsletter Parsing
Convert HTML emails into readable text or markdown for indexing, summarization, or feeding into language models.
CMS Content Migration
Export HTML from one CMS and transform it to markdown for import into a static site generator, wiki, or headless CMS.
Document Preprocessing
Clean and normalize HTML documents before embedding or indexing. Strip formatting artifacts and extract pure semantic content.
Related Resources
Transform HTML at scale
The most cost-efficient way to convert web content into clean, structured formats.