Rebuilding PDF + DOCX Export: WeasyPrint, Pandoc Lua Filters, and Trusting the Frontend

Engineering May 6, 2026 · 11 min read

For a document editor, "Export" has to mean one thing: what you see is what comes out of the file. Anything else feels broken, even when the broken-ness is small — a heading that's centered on screen but left-aligned in the PDF, a paragraph that wraps differently because the server's font metrics disagree with the browser's, an exported DOCX whose blue heading turned black because Word doesn't speak CSS.

Over the last two weeks we rewrote the entire export pipeline behind AgentDoc – the AI-native docs editor (also: agent doc, agentdocs, docedit) – to make WYSIWYG actually true. Two formats, two engines, one architectural insight: the frontend is already the source of truth for layout, so the server should follow it, not re-derive it.

The old design and why it kept being subtly wrong

Pre-rewrite, both PDF and DOCX export worked by sending the document's raw markdown to the backend, where a renderer would (a) re-paginate it server-side, (b) apply CSS that was a hand-maintained subset of the editor's style.css, and (c) hand the result to WeasyPrint or to a quick docx writer. Three failure modes followed:

Pagination disagreed. The browser's paginator measured each block in real DOM with real fonts loaded; the server measured a different DOM with subtly different fonts and different available widths. Ten-page documents reliably came out as nine or eleven, and the page where a heading appeared shifted.
CSS divergence. The export CSS was a copy of style.css with edits applied lazily. Tables got centered headers in the editor and left-aligned ones in the PDF; <blockquote> had a left border on screen and not in the export. We were debugging individual selectors instead of the architecture.
DOCX was a different beast entirely. The previous quick-and-dirty path inlined header/footer as italic text at the top and bottom of the body, dropped colour and font classes, and produced something Word users were polite about but did not actually use.

Server-side pagination of editor content is a caching problem disguised as a layout problem. The browser already did the work; sending the result down is cheaper than re-doing it.

The architectural shift: the frontend supplies an html_map

The editor already knows where every page break lands. Its paginator runs the layout, walks the DOM, and produces an array of HTML chunks – one per page – plus the active layout meta (page_size, margin_mm). All we needed was a way to send that to the export endpoints.

The new export contract: when the user clicks "Export PDF" or "Export DOCX", the frontend attaches the pagination map (html_map) to the request body. The server treats it as the authoritative layout and stops trying to compute its own.

POST /api/doc/{id}/pdf
{
  "html_map": [
    "<div class='page-content'>…page 1 HTML…</div>",
    "<div class='page-content'>…page 2 HTML…</div>",
    …
  ],
  "header_html": "…",
  "footer_html": "…",
  "page_size": "A4",
  "margin_mm": 20
}

The PDF generator wraps each entry in a fixed-size .page-content block, forces a hard page break between them via CSS break-after: page, and lets WeasyPrint do the actual rasterisation. Header and footer go through WeasyPrint's running() machinery so they land in the top-center / bottom-center margin boxes on every physical page – including the rare second physical page that an oversized single map entry might spill onto.

Two pipelines, one fallback

Not every export request comes from the browser. External agents calling our MCP server can issue a "download this document" tool call before a human ever opened the document in the editor – which means there is no html_map yet, because no browser has done the layout work.

For that case, the server falls back to WeasyPrint's native pagination via @page rules. We had a fallback before, but it had the bug that triggered the rewrite in the first place: a fixed-height clipping rule meant to handle the html_map case was also firing on the fallback path, and any content past the first page was silently truncated. Investigations like that are the reason we now keep two distinct codepaths instead of one clever generalisation.

DOCX: Pandoc + a hand-built reference.docx

DOCX deserves special treatment because Word does not speak CSS at all. The naive approach – HTML → some library → .docx – throws away every class on every span and gives you a document that has the right text but none of the styling.

We use Pandoc for the body, python-docx for the Word-specific bits Pandoc can't reach (page header, page footer), and a hand-built reference.docx as a style template.

The reference.docx trick

Pandoc's --reference-doc=… flag lets you supply a Word document whose styles define your "Heading 1", "Normal", "Title", and so on. Pandoc renders the markdown, applies these styles, and the result inherits everything you put into the reference – fonts, colours, line-height, margins, list indentation.

We generate reference.docx from a small one-shot build script (templates/build_reference_docx.py) that emits Pandoc's default reference and then patches it with python-docx so:

Heading 1 / 2 / 3 / Title use Playfair Display, bold, in our editor's text-primary navy (#111827) – exactly the same colour the heading has in the browser.
Normal / List Paragraph / Body use Inter 11pt in slate (#374151), with line-height 1.7 and 0.25" list indentation – matched against the editor.
Header / Footer styles are Inter 10pt with the thin bottom border the editor renders via CSS, so the visual divider survives the format change.
The fontTable.xml is extended with one <w:font> entry per editor typeface, each carrying a <w:altName> hint pointing at the nearest Word-standard family (Inter → Calibri, Playfair Display → Cambria, Roboto Mono → Consolas, etc.). Word users without our exact fonts installed see the alt substitution instead of falling back to Times New Roman.

Inline classes: the Lua filter

Pandoc's docx writer drops CSS-style attributes on Span elements. That's fine for plain markdown but a problem for AgentDoc, where every inline style is encoded as a stand-off class on a span: [text]{.color-red}, [text]{.highlight-yellow}, [text]{.font-lora .size-xl}, and so on.

The fix: a small Pandoc Lua filter (templates/inline_styles.lua) that runs over the AST during the docx writer pass and translates classes into raw OOXML run properties:

function Span(el)
  for _, cls in ipairs(el.classes) do
    if cls:match("^color%-") then
      local hex = COLORS[cls:sub(7)]
      if hex then
        return pandoc.RawInline("openxml",
          '<w:rPr><w:color w:val="'..hex..'"/></w:rPr>'
        ), el.content
      end
    elseif cls:match("^highlight%-") then …
    elseif cls:match("^font%-")      then …
    elseif cls:match("^size%-")      then …
    end
  end
end

Coverage is exhaustive on purpose:

Colours: 17 named editor colours mapped to the actual hex used by CSS, not W3C keyword hexes. color-red is the editor's softer alizarin #E74C3C, not #FF0000; color-blue is #3498DB, not #0000FF. A blue heading in the browser now renders as the same blue in Word.
Highlights: Word's full named-highlight set (yellow / green / cyan / magenta / blue / red plus dark/light variants); anything outside that set falls back to yellow rather than dropping silently.
Fonts: 12 entries emitting <w:rFonts ascii="…" hAnsi="…" cs="…" eastAsia="…"/> for every editor typeface.
Sizes: 7 tokens (xs..3xl) emitted as <w:sz w:val="halfPt"/> values, em-derived against the 13pt body baseline used by reference.docx.
Decorations: decoration-bold / -italic / -strikethrough are translated to standard markdown before Pandoc, so we get real <w:b/>, <w:i/> etc. runs instead of raw OOXML.

Combinations stack. A span like [text]{.color-red .size-xl .font-lora} emits a single run with all three properties set in a single <w:rPr>.

Header and footer: the python-docx post-pass

Pandoc's docx writer produces a body but doesn't fill in section header / footer. Our previous workaround – inlining them as italicised lead-in / closing lines – was an ergonomic disaster: any user who exported a real letter saw their header text floating above the recipient's address.

After Pandoc runs, a python-docx post-pass opens the resulting docx, walks section.header / section.footer, and writes the header/footer-markdown through a small inline parser. The parser handles bold (**…**), italic (*…*), and the same stand-off span syntax the body uses (color-X, highlight-X, decoration-*) – mapped to Word run colours, highlights, and bold/italic <w:rPr> properties respectively.

The result lives in the actual Word section header, which means it shows up on every page and prints correctly – including on re-paginations Word does itself when the recipient opens the document.

Workflow gating: T+ gets the new export, T stays bit-stable

DOCX export is, deliberately, a feature only available on Workflow T+. T stays bit-stable because the tool granularity benchmarks ran against its exact tool surface, and we want those CSV files to stay reproducible. T+ is the post-benchmark mutable production track and gets the additions.

This is enforced in two places to make leaks impossible:

agent/routers/chat.py defines T_PLUS_EXCLUSIVE_TOOLS = {"trigger_docx_download"}; T's branch explicitly excludes that set so a newly registered tool cannot accidentally land in T's frozen schema.
mcp_workflow_middleware.WORKFLOW_T_DENY also lists the tool, so external agents (which are pinned to T's surface by design) do not see it via the MCP endpoint either.

What we explicitly didn't do

Two things stayed undone on purpose; both are tradeoffs worth being explicit about:

Font embedding. A truly 1:1 fix would embed the actual Inter / Playfair Display / etc. TTFs into every generated DOCX. It would also add ~400 KB per export, plus per-font legal review for redistribution. We rely on Word's font substitution + the altName hints in our extended fontTable instead. Most of the editor's families are popular Google Fonts that many Office 365 users already have.
One-pipeline-to-rule-them-all. We now keep two distinct PDF codepaths (html_map vs. fallback) and a separate DOCX path. Generalising further would let the kind of fixed-height-clipping bug from before recur silently. Two codepaths plus integration tests beats one clever one.

The integration tests that actually catch regressions

The thing that makes this rewrite stick is a set of manual audit scripts that introspect the rendered output:

backend/tests/manual_pdf_audit.py    # uses pypdf to read text + font metadata
backend/tests/manual_docx_audit.py   # walks the OOXML and checks run properties
backend/tests/manual_format_parity.py # diffs editor classes against Word run props

These are not unit tests; they're tools we run after non-trivial changes to confirm that the exported file actually has the colour we expect on the heading we expect. We added poppler-utils and pypdf to the backend Dockerfile so they work on the production image too, which means we can run them against actual user documents when investigating a complaint.

What we learned worth keeping

Three durable lessons from this rewrite:

Don't re-derive what the client already computed. If the browser paginated the document, send the pagination down. Any server-side recomputation will disagree in subtle ways and you will spend weeks chasing the disagreements.
Use the format's own template mechanism instead of fighting it. Pandoc --reference-doc=… is the right escape hatch for DOCX. WeasyPrint's @page rules are the right escape hatch for PDF. Hand-rolled HTML-to-DOCX libraries always lose to Pandoc + a reference document on any non-trivial output.
Two codepaths and a test that catches the difference is better than one codepath you have to trust. The cost of the parallel paths is small; the cost of a silent truncation bug in the fallback is large.

The next post in this stream is on the voice-first architecture – including some recent reliability work on Gemini Live session resumption that's also worth a write-up of its own. If there's a particular detail you'd like more depth on, the feedback widget on the bottom-right of every page on this site goes straight to our inbox.

← Patch Notes – April 2026 Tool Granularity in LLM Agents →