Rebuilding PDF + DOCX Export: WeasyPrint, Pandoc Lua Filters, and Trusting the Frontend
For a document editor, "Export" has to mean one thing: what you see is what comes out of the file. Anything else feels broken, even when the broken-ness is small — a heading that's centered on screen but left-aligned in the PDF, a paragraph that wraps differently because the server's font metrics disagree with the browser's, an exported DOCX whose blue heading turned black because Word doesn't speak CSS.
Over the last two weeks we rewrote the entire export pipeline behind AgentDoc – the AI-native docs editor (also: agent doc, agentdocs, docedit) – to make WYSIWYG actually true. Two formats, two engines, one architectural insight: the frontend is already the source of truth for layout, so the server should follow it, not re-derive it.
The old design and why it kept being subtly wrong
Pre-rewrite, both PDF and DOCX export worked by sending the document's raw markdown to
the backend, where a renderer would (a) re-paginate it server-side, (b) apply CSS that
was a hand-maintained subset of the editor's style.css, and (c) hand the
result to WeasyPrint or to a quick docx writer. Three failure modes followed:
- Pagination disagreed. The browser's paginator measured each block in real DOM with real fonts loaded; the server measured a different DOM with subtly different fonts and different available widths. Ten-page documents reliably came out as nine or eleven, and the page where a heading appeared shifted.
- CSS divergence. The export CSS was a copy of
style.csswith edits applied lazily. Tables got centered headers in the editor and left-aligned ones in the PDF;<blockquote>had a left border on screen and not in the export. We were debugging individual selectors instead of the architecture. - DOCX was a different beast entirely. The previous quick-and-dirty path inlined header/footer as italic text at the top and bottom of the body, dropped colour and font classes, and produced something Word users were polite about but did not actually use.
Server-side pagination of editor content is a caching problem disguised as a layout problem. The browser already did the work; sending the result down is cheaper than re-doing it.
The architectural shift: the frontend supplies an html_map
The editor already knows where every page break lands. Its paginator runs the layout, walks
the DOM, and produces an array of HTML chunks – one per page – plus the active layout
meta (page_size, margin_mm). All we needed was a way to
send that to the export endpoints.
The new export contract: when the user clicks "Export PDF" or "Export DOCX", the frontend
attaches the pagination map (html_map) to the request body. The server
treats it as the authoritative layout and stops trying to compute its own.
POST /api/doc/{id}/pdf
{
"html_map": [
"<div class='page-content'>…page 1 HTML…</div>",
"<div class='page-content'>…page 2 HTML…</div>",
…
],
"header_html": "…",
"footer_html": "…",
"page_size": "A4",
"margin_mm": 20
}
The PDF generator wraps each entry in a fixed-size .page-content block,
forces a hard page break between them via CSS break-after: page, and lets
WeasyPrint do the actual rasterisation. Header and footer go through WeasyPrint's
running() machinery so they land in the top-center / bottom-center margin
boxes on every physical page – including the rare second physical page that an oversized
single map entry might spill onto.
Two pipelines, one fallback
Not every export request comes from the browser. External agents calling our MCP server
can issue a "download this document" tool call before a human ever opened the
document in the editor – which means there is no html_map yet, because no
browser has done the layout work.
For that case, the server falls back to WeasyPrint's native pagination via @page
rules. We had a fallback before, but it had the bug that triggered the rewrite in the
first place: a fixed-height clipping rule meant to handle the html_map case was also
firing on the fallback path, and any content past the first page was silently truncated.
Investigations like that are the reason we now keep two distinct codepaths instead of one
clever generalisation.
DOCX: Pandoc + a hand-built reference.docx
DOCX deserves special treatment because Word does not speak CSS at all. The naive approach – HTML → some library → .docx – throws away every class on every span and gives you a document that has the right text but none of the styling.
We use Pandoc for the body, python-docx for the
Word-specific bits Pandoc can't reach (page header, page footer), and a hand-built
reference.docx as a style template.
The reference.docx trick
Pandoc's --reference-doc=… flag lets you supply a Word document whose styles
define your "Heading 1", "Normal", "Title", and so on. Pandoc renders the
markdown, applies these styles, and the result inherits everything you put into the
reference – fonts, colours, line-height, margins, list indentation.
We generate reference.docx from a small one-shot build script
(templates/build_reference_docx.py) that emits Pandoc's default reference
and then patches it with python-docx so:
- Heading 1 / 2 / 3 / Title use Playfair Display, bold, in our editor's text-primary
navy (
#111827) – exactly the same colour the heading has in the browser. - Normal / List Paragraph / Body use Inter 11pt in slate (
#374151), with line-height 1.7 and 0.25" list indentation – matched against the editor. - Header / Footer styles are Inter 10pt with the thin bottom border the editor renders via CSS, so the visual divider survives the format change.
- The
fontTable.xmlis extended with one<w:font>entry per editor typeface, each carrying a<w:altName>hint pointing at the nearest Word-standard family (Inter → Calibri, Playfair Display → Cambria, Roboto Mono → Consolas, etc.). Word users without our exact fonts installed see the alt substitution instead of falling back to Times New Roman.
Inline classes: the Lua filter
Pandoc's docx writer drops CSS-style attributes on Span elements. That's
fine for plain markdown but a problem for AgentDoc, where every inline style is encoded
as a stand-off class on a span: [text]{.color-red},
[text]{.highlight-yellow}, [text]{.font-lora .size-xl}, and so
on.
The fix: a small Pandoc Lua filter
(templates/inline_styles.lua) that runs over the AST during the docx writer
pass and translates classes into raw OOXML run properties:
function Span(el)
for _, cls in ipairs(el.classes) do
if cls:match("^color%-") then
local hex = COLORS[cls:sub(7)]
if hex then
return pandoc.RawInline("openxml",
'<w:rPr><w:color w:val="'..hex..'"/></w:rPr>'
), el.content
end
elseif cls:match("^highlight%-") then …
elseif cls:match("^font%-") then …
elseif cls:match("^size%-") then …
end
end
end
Coverage is exhaustive on purpose:
- Colours: 17 named editor colours mapped to the actual hex used by
CSS, not W3C keyword hexes.
color-redis the editor's softer alizarin#E74C3C, not#FF0000;color-blueis#3498DB, not#0000FF. A blue heading in the browser now renders as the same blue in Word. - Highlights: Word's full named-highlight set (yellow / green / cyan / magenta / blue / red plus dark/light variants); anything outside that set falls back to yellow rather than dropping silently.
- Fonts: 12 entries emitting
<w:rFonts ascii="…" hAnsi="…" cs="…" eastAsia="…"/>for every editor typeface. - Sizes: 7 tokens (
xs..3xl) emitted as<w:sz w:val="halfPt"/>values, em-derived against the 13pt body baseline used by reference.docx. - Decorations:
decoration-bold/-italic/-strikethroughare translated to standard markdown before Pandoc, so we get real<w:b/>,<w:i/>etc. runs instead of raw OOXML.
Combinations stack. A span like [text]{.color-red .size-xl .font-lora}
emits a single run with all three properties set in a single <w:rPr>.
Header and footer: the python-docx post-pass
Pandoc's docx writer produces a body but doesn't fill in section header / footer. Our previous workaround – inlining them as italicised lead-in / closing lines – was an ergonomic disaster: any user who exported a real letter saw their header text floating above the recipient's address.
After Pandoc runs, a python-docx post-pass opens the resulting docx, walks
section.header / section.footer, and writes the
header/footer-markdown through a small inline parser. The parser handles bold (**…**),
italic (*…*), and the same stand-off span syntax the body uses
(color-X, highlight-X, decoration-*) – mapped to
Word run colours, highlights, and bold/italic <w:rPr> properties
respectively.
The result lives in the actual Word section header, which means it shows up on every page and prints correctly – including on re-paginations Word does itself when the recipient opens the document.
Workflow gating: T+ gets the new export, T stays bit-stable
DOCX export is, deliberately, a feature only available on Workflow T+. T stays bit-stable because the tool granularity benchmarks ran against its exact tool surface, and we want those CSV files to stay reproducible. T+ is the post-benchmark mutable production track and gets the additions.
This is enforced in two places to make leaks impossible:
agent/routers/chat.pydefinesT_PLUS_EXCLUSIVE_TOOLS = {"trigger_docx_download"}; T's branch explicitly excludes that set so a newly registered tool cannot accidentally land in T's frozen schema.mcp_workflow_middleware.WORKFLOW_T_DENYalso lists the tool, so external agents (which are pinned to T's surface by design) do not see it via the MCP endpoint either.
What we explicitly didn't do
Two things stayed undone on purpose; both are tradeoffs worth being explicit about:
- Font embedding. A truly 1:1 fix would embed the actual
Inter / Playfair Display / etc. TTFs into every generated DOCX. It would also add
~400 KB per export, plus per-font legal review for redistribution. We rely on Word's
font substitution + the
altNamehints in our extended fontTable instead. Most of the editor's families are popular Google Fonts that many Office 365 users already have. - One-pipeline-to-rule-them-all. We now keep two distinct PDF codepaths (html_map vs. fallback) and a separate DOCX path. Generalising further would let the kind of fixed-height-clipping bug from before recur silently. Two codepaths plus integration tests beats one clever one.
The integration tests that actually catch regressions
The thing that makes this rewrite stick is a set of manual audit scripts that introspect the rendered output:
backend/tests/manual_pdf_audit.py # uses pypdf to read text + font metadata
backend/tests/manual_docx_audit.py # walks the OOXML and checks run properties
backend/tests/manual_format_parity.py # diffs editor classes against Word run props
These are not unit tests; they're tools we run after non-trivial changes to confirm that
the exported file actually has the colour we expect on the heading we expect. We added
poppler-utils and pypdf to the backend Dockerfile so they work
on the production image too, which means we can run them against actual user documents
when investigating a complaint.
What we learned worth keeping
Three durable lessons from this rewrite:
- Don't re-derive what the client already computed. If the browser paginated the document, send the pagination down. Any server-side recomputation will disagree in subtle ways and you will spend weeks chasing the disagreements.
- Use the format's own template mechanism instead of fighting it. Pandoc
--reference-doc=…is the right escape hatch for DOCX. WeasyPrint's@pagerules are the right escape hatch for PDF. Hand-rolled HTML-to-DOCX libraries always lose to Pandoc + a reference document on any non-trivial output. - Two codepaths and a test that catches the difference is better than one codepath you have to trust. The cost of the parallel paths is small; the cost of a silent truncation bug in the fallback is large.
The next post in this stream is on the voice-first architecture – including some recent reliability work on Gemini Live session resumption that's also worth a write-up of its own. If there's a particular detail you'd like more depth on, the feedback widget on the bottom-right of every page on this site goes straight to our inbox.