At some point in nearly every Astro project, you hit the same wall: a pile of HTML files — scraped content, a CMS export, hand-authored pages from a previous site — that needs to become Markdown so Astro's content collections can own it. Copying and pasting is out of the question. You need a script that runs once, produces clean .md files with correct frontmatter, and gets out of your way.
This guide covers two practical paths: Turndown for most projects, and the unified / rehype-remark pipeline for cases where you need more control over the AST. Both are well-maintained, work in Node.js, and produce Markdown that Astro can parse without complaints.
Why Markdown matters in Astro projects
Astro's content collections expect .md or .mdx files (or a custom loader). When you store content as Markdown, you get frontmatter validation via Zod, type-safe getCollection() calls, and automatic slug generation — none of which you get if you're loading raw HTML. Converting HTML to Markdown once is almost always worth the effort compared to maintaining a custom HTML loader forever.
Tailwind CSS v4 projects in particular benefit because Tailwind's @tailwindcss/typography plugin — which styles the prose class — operates on whatever HTML Astro renders from Markdown. That render step gives you a clean hook to apply consistent typographic styles without touching individual HTML files.
Option 1: Turndown (the direct approach)
Turndown (currently v7.2.4) is a standalone JavaScript library that takes an HTML string or DOM node and returns a Markdown string. It runs in Node.js and in the browser, has no peer dependencies, and exposes a clean rule-based customization API.
Installation
npm install turndown turndown-plugin-gfm
The optional turndown-plugin-gfm package adds GitHub Flavored Markdown support — tables, strikethrough, and task lists. It is worth including for most content migrations because real-world HTML almost always contains tables.
Basic usage
import TurndownService from 'turndown';
import { gfm } from 'turndown-plugin-gfm';
const service = new TurndownService({
headingStyle: 'atx', // # H1, ## H2 — matches Astro/remark conventions
bulletListMarker: '-',
codeBlockStyle: 'fenced', // ```code``` blocks instead of indented
hr: '---',
});
service.use(gfm);
const html = '<h1>Hello</h1><p>A <strong>quick</strong> example.</p>';
const markdown = service.turndown(html);
console.log(markdown);
// # Hello
//
// A **quick** example.
The headingStyle: 'atx' option is important for Astro: the default setext style (underline-based headings) works fine syntactically but looks odd and can confuse some remark plugins. ATX headings (# prefix) are the universal convention.
Adding custom rules
Turndown's addRule() method lets you intercept specific HTML elements. A common need in content migrations is stripping wrapper divs or converting custom elements to plain text:
// Remove <figure> wrappers but keep their children
service.addRule('unwrapFigure', {
filter: 'figure',
replacement: (content) => content,
});
// Convert <mark> highlights to bold
service.addRule('highlight', {
filter: 'mark',
replacement: (content) => `**${content}**`,
});
// Strip analytics/tracking elements entirely
service.addRule('removeTracking', {
filter: ['noscript', 'pixel'],
replacement: () => '',
});
Rules are evaluated in the order they are added, and the first matching rule wins. The built-in rules act as a fallback, so you only need to define overrides for elements that Turndown handles incorrectly for your specific content.
A complete migration script for Astro content collections
Here is a full Node.js script you can drop into your project's scripts/ directory. It reads every .html file from a source directory, extracts a title from the <title> tag (or falls back to the filename), converts the body content to Markdown, writes a .md file with Astro-compatible frontmatter, and reports what it did.
// scripts/html-to-md.mjs
// Usage: node scripts/html-to-md.mjs <input-dir> <output-dir>
//
// Example: node scripts/html-to-md.mjs ./old-html ./src/content/blog
import { readdir, readFile, writeFile, mkdir } from 'node:fs/promises';
import { join, basename, extname } from 'node:path';
import TurndownService from 'turndown';
import { gfm } from 'turndown-plugin-gfm';
const [, , inputDir, outputDir] = process.argv;
if (!inputDir || !outputDir) {
console.error('Usage: node html-to-md.mjs <input-dir> <output-dir>');
process.exit(1);
}
// Configure Turndown once; reuse the instance for performance
const td = new TurndownService({
headingStyle: 'atx',
bulletListMarker: '-',
codeBlockStyle: 'fenced',
});
td.use(gfm);
// Strip elements that should not appear in content
td.addRule('removeScripts', {
filter: ['script', 'style', 'nav', 'header', 'footer'],
replacement: () => '',
});
function slugify(filename) {
return basename(filename, extname(filename))
.toLowerCase()
.replace(/[^a-z0-9]+/g, '-')
.replace(/^-|-$/g, '');
}
function extractTitle(html) {
const match = html.match(/<title[^>]*>([^<]+)<\/title>/i);
return match ? match[1].trim() : null;
}
function extractBody(html) {
// Prefer <main>, fall back to <body>, fall back to the full document
const mainMatch = html.match(/<main[^>]*>([\s\S]*?)<\/main>/i);
if (mainMatch) return mainMatch[1];
const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
if (bodyMatch) return bodyMatch[1];
return html;
}
async function convertFile(inputPath, outputPath) {
const html = await readFile(inputPath, 'utf-8');
const title = extractTitle(html) ?? slugify(inputPath);
const body = extractBody(html);
const markdown = td.turndown(body);
const frontmatter = [
'---',
`title: "${title.replace(/"/g, '\\"')}"`,
`slug: "${slugify(inputPath)}"`,
`pubDate: "${new Date().toISOString().slice(0, 10)}"`,
'---',
].join('\n');
await writeFile(outputPath, `${frontmatter}\n\n${markdown}\n`, 'utf-8');
console.log(` ✓ ${basename(inputPath)} → ${basename(outputPath)}`);
}
async function run() {
await mkdir(outputDir, { recursive: true });
const files = (await readdir(inputDir)).filter((f) => f.endsWith('.html'));
if (files.length === 0) {
console.log('No .html files found in', inputDir);
return;
}
console.log(`Converting ${files.length} file(s)…`);
for (const file of files) {
const inputPath = join(inputDir, file);
const outputPath = join(outputDir, file.replace(/\.html$/, '.md'));
await convertFile(inputPath, outputPath);
}
console.log('Done.');
}
run().catch((err) => {
console.error(err);
process.exit(1);
});
Run it once with a real content directory, spot-check a few output files, then wire it into your content collection schema. The script deliberately processes files sequentially rather than in parallel — this keeps error messages readable and avoids hitting any file-descriptor limits on large batches.
Option 2: the unified / rehype-remark pipeline
If you are already using Astro's remark/rehype plugin system, or if you need precise control over the conversion — preserving specific attributes, walking the AST to extract structured metadata, or integrating with a larger content pipeline — the rehype-remark package is the right tool.
The unified ecosystem models both HTML and Markdown as abstract syntax trees (ASTs). rehype handles the HTML AST (hast), and remark handles the Markdown AST (mdast). rehype-remark is the bridge between them.
Installation
npm install unified rehype-parse rehype-remark remark-stringify
All four packages are ESM-only as of their current versions. Make sure your script uses .mjs or that your package.json has "type": "module".
Basic pipeline
import { unified } from 'unified';
import rehypeParse from 'rehype-parse';
import rehypeRemark from 'rehype-remark';
import remarkStringify from 'remark-stringify';
const processor = unified()
.use(rehypeParse, { fragment: false }) // parse full HTML document
.use(rehypeRemark) // convert hast → mdast
.use(remarkStringify, {
bullet: '-',
fence: '`',
fences: true,
incrementListMarker: false,
});
const html = '<article><h2>Section</h2><p>Content here.</p></article>';
const file = await processor.process(html);
console.log(String(file));
Because you have access to both ASTs, you can insert custom plugins at any step. For example, a plugin that strips all <div> wrappers from the hast before conversion, or one that walks the mdast afterward to extract all heading text for a generated table of contents.
Choosing between Turndown and unified
Both approaches produce valid Markdown. The practical differences come down to what you are doing around the conversion:
- Use Turndown when you want a simple, dependency-light script that you run once and discard. The API is intuitive, custom rules are easy to add, and the output is predictable for standard HTML.
- Use unified when the conversion is part of a larger content pipeline — for instance, when you are already running remark plugins for syntax highlighting, link validation, or reading-time calculation. Adding
rehype-remarkto an existing pipeline costs almost nothing. - Use
node-html-markdownif you are processing very large volumes of HTML (gigabytes) in a tight server-side loop. It benchmarks roughly 1.6x faster than Turndown for reused instances, at the cost of a less flexible customization API.
Cleaning up the output
No converter produces perfect output every time. After running your migration script, a quick search-and-replace pass catches the most common issues:
- Excessive blank lines — Markdown is fine with one blank line between blocks; three or more usually means the source had empty wrapper elements. Run a regex like
/\n{3,}/gand replace with\n\n. - Escaped characters that did not need escaping — Turndown sometimes escapes underscores inside words (e.g.,
some\_thing). These are harmless but ugly. A targeted replace handles them. - Absolute URLs pointing at a domain you no longer own — worth a grep pass before committing to a content collection.
- Missing alt text on images — Turndown preserves whatever alt text the HTML had, which may be empty. Astro will build fine, but accessibility and SEO suffer. Flag empty alts for a manual review pass.
Wiring the output into an Astro content collection
Once you have .md files with frontmatter, define a collection in src/content.config.ts:
import { defineCollection, z } from 'astro:content';
import { glob } from 'astro/loaders';
const blog = defineCollection({
loader: glob({ pattern: '**/*.md', base: './src/content/blog' }),
schema: z.object({
title: z.string(),
slug: z.string().optional(),
pubDate: z.coerce.date(),
}),
});
export const collections = { blog };
Then query it in any .astro page:
import { getCollection } from 'astro:content';
const posts = await getCollection('blog');
// posts is typed; posts[0].data.title, posts[0].data.pubDate, etc.
Zod will surface any frontmatter mismatches at build time, which is exactly what you want — better to catch a missing pubDate during npm run build than to discover it after deploy.
Conclusion
Converting HTML to Markdown in JavaScript is a solved problem. Turndown handles 90% of real-world cases with minimal setup; the unified pipeline handles the rest. The migration script above is a working starting point — adjust the extractBody selector to match your source HTML structure, add custom rules for any elements your content uses heavily, and let Astro's content collections do the rest. The payoff is a codebase where every piece of content is version-controlled, typed, and renderable without any per-file HTML editing.