Rebuilding PDF + DOCX Export: WeasyPrint, Pandoc Lua Filters, and Trusting the Frontend

Engineering · 11 min read

For a document editor, "Export" has to mean one thing: what you see is what comes out of the file. Anything else feels broken, even when the broken-ness is small — a heading that's centered on screen but left-aligned in the PDF, a paragraph that wraps differently because the server's font metrics disagree with the browser's, an exported DOCX whose blue heading turned black because Word doesn't speak CSS.

Over the last two weeks we rewrote the entire export pipeline behind AgentDoc – the AI-native docs editor (also: agent doc, agentdocs, docedit) – to make WYSIWYG actually true. Two formats, two engines, one architectural insight: the frontend is already the source of truth for layout, so the server should follow it, not re-derive it.

The old design and why it kept being subtly wrong

Pre-rewrite, both PDF and DOCX export worked by sending the document's raw markdown to the backend, where a renderer would (a) re-paginate it server-side, (b) apply CSS that was a hand-maintained subset of the editor's style.css, and (c) hand the result to WeasyPrint or to a quick docx writer. Three failure modes followed:

Server-side pagination of editor content is a caching problem disguised as a layout problem. The browser already did the work; sending the result down is cheaper than re-doing it.

The architectural shift: the frontend supplies an html_map

The editor already knows where every page break lands. Its paginator runs the layout, walks the DOM, and produces an array of HTML chunks – one per page – plus the active layout meta (page_size, margin_mm). All we needed was a way to send that to the export endpoints.

The new export contract: when the user clicks "Export PDF" or "Export DOCX", the frontend attaches the pagination map (html_map) to the request body. The server treats it as the authoritative layout and stops trying to compute its own.

POST /api/doc/{id}/pdf
{
  "html_map": [
    "<div class='page-content'>…page 1 HTML…</div>",
    "<div class='page-content'>…page 2 HTML…</div>",
    …
  ],
  "header_html": "…",
  "footer_html": "…",
  "page_size": "A4",
  "margin_mm": 20
}

The PDF generator wraps each entry in a fixed-size .page-content block, forces a hard page break between them via CSS break-after: page, and lets WeasyPrint do the actual rasterisation. Header and footer go through WeasyPrint's running() machinery so they land in the top-center / bottom-center margin boxes on every physical page – including the rare second physical page that an oversized single map entry might spill onto.

Two pipelines, one fallback

Not every export request comes from the browser. External agents calling our MCP server can issue a "download this document" tool call before a human ever opened the document in the editor – which means there is no html_map yet, because no browser has done the layout work.

For that case, the server falls back to WeasyPrint's native pagination via @page rules. We had a fallback before, but it had the bug that triggered the rewrite in the first place: a fixed-height clipping rule meant to handle the html_map case was also firing on the fallback path, and any content past the first page was silently truncated. Investigations like that are the reason we now keep two distinct codepaths instead of one clever generalisation.

DOCX: Pandoc + a hand-built reference.docx

DOCX deserves special treatment because Word does not speak CSS at all. The naive approach – HTML → some library → .docx – throws away every class on every span and gives you a document that has the right text but none of the styling.

We use Pandoc for the body, python-docx for the Word-specific bits Pandoc can't reach (page header, page footer), and a hand-built reference.docx as a style template.

The reference.docx trick

Pandoc's --reference-doc=… flag lets you supply a Word document whose styles define your "Heading 1", "Normal", "Title", and so on. Pandoc renders the markdown, applies these styles, and the result inherits everything you put into the reference – fonts, colours, line-height, margins, list indentation.

We generate reference.docx from a small one-shot build script (templates/build_reference_docx.py) that emits Pandoc's default reference and then patches it with python-docx so:

Inline classes: the Lua filter

Pandoc's docx writer drops CSS-style attributes on Span elements. That's fine for plain markdown but a problem for AgentDoc, where every inline style is encoded as a stand-off class on a span: [text]{.color-red}, [text]{.highlight-yellow}, [text]{.font-lora .size-xl}, and so on.

The fix: a small Pandoc Lua filter (templates/inline_styles.lua) that runs over the AST during the docx writer pass and translates classes into raw OOXML run properties:

function Span(el)
  for _, cls in ipairs(el.classes) do
    if cls:match("^color%-") then
      local hex = COLORS[cls:sub(7)]
      if hex then
        return pandoc.RawInline("openxml",
          '<w:rPr><w:color w:val="'..hex..'"/></w:rPr>'
        ), el.content
      end
    elseif cls:match("^highlight%-") then …
    elseif cls:match("^font%-")      then …
    elseif cls:match("^size%-")      then …
    end
  end
end

Coverage is exhaustive on purpose:

Combinations stack. A span like [text]{.color-red .size-xl .font-lora} emits a single run with all three properties set in a single <w:rPr>.

Header and footer: the python-docx post-pass

Pandoc's docx writer produces a body but doesn't fill in section header / footer. Our previous workaround – inlining them as italicised lead-in / closing lines – was an ergonomic disaster: any user who exported a real letter saw their header text floating above the recipient's address.

After Pandoc runs, a python-docx post-pass opens the resulting docx, walks section.header / section.footer, and writes the header/footer-markdown through a small inline parser. The parser handles bold (**…**), italic (*…*), and the same stand-off span syntax the body uses (color-X, highlight-X, decoration-*) – mapped to Word run colours, highlights, and bold/italic <w:rPr> properties respectively.

The result lives in the actual Word section header, which means it shows up on every page and prints correctly – including on re-paginations Word does itself when the recipient opens the document.

Workflow gating: T+ gets the new export, T stays bit-stable

DOCX export is, deliberately, a feature only available on Workflow T+. T stays bit-stable because the tool granularity benchmarks ran against its exact tool surface, and we want those CSV files to stay reproducible. T+ is the post-benchmark mutable production track and gets the additions.

This is enforced in two places to make leaks impossible:

What we explicitly didn't do

Two things stayed undone on purpose; both are tradeoffs worth being explicit about:

The integration tests that actually catch regressions

The thing that makes this rewrite stick is a set of manual audit scripts that introspect the rendered output:

backend/tests/manual_pdf_audit.py    # uses pypdf to read text + font metadata
backend/tests/manual_docx_audit.py   # walks the OOXML and checks run properties
backend/tests/manual_format_parity.py # diffs editor classes against Word run props

These are not unit tests; they're tools we run after non-trivial changes to confirm that the exported file actually has the colour we expect on the heading we expect. We added poppler-utils and pypdf to the backend Dockerfile so they work on the production image too, which means we can run them against actual user documents when investigating a complaint.

What we learned worth keeping

Three durable lessons from this rewrite:

The next post in this stream is on the voice-first architecture – including some recent reliability work on Gemini Live session resumption that's also worth a write-up of its own. If there's a particular detail you'd like more depth on, the feedback widget on the bottom-right of every page on this site goes straight to our inbox.

← Patch Notes – April 2026 Tool Granularity in LLM Agents →