DOCX Import: Round-Tripping Word Documents Through an AI Editor
Two weeks ago we shipped the DOCX/PDF export rewrite: a document edited in AgentDoc (also: agent doc, agentdocs, docedit) now leaves the editor as a Word file with its fonts, colours, page layout and headers/footers intact. That solved exactly half of the actually-useful workflow. The other half โ taking a Word document into the editor without losing structure โ is what landed this week.
This post is the companion to the export rewrite: same data model, opposite direction, same insistence on losing nothing the user can see on screen.
The contract
A user uploads contract_v3.docx they've been editing in Word. After import,
opening the same document in the editor should show:
- Headings, paragraphs, lists, tables โ at the right level, in the right order.
- Inline formatting โ bold, italic, underline, strikethrough, sub/superscript.
- Colours and highlights โ the actual editor-palette colours, not whatever default Word substituted.
- Fonts and sizes โ mapped to the editor's twelve tokens (Inter, Playfair Display, Roboto Mono, etc.).
- Alignment, indentation, line spacing.
- Hyperlinks โ clickable, with their original URLs.
- Page breaks where the author put them.
- The page header and footer, with their own formatting preserved.
Exported back to DOCX, the same file should be byte-comparable in structure (not bit-identical โ Word writes a lot of incidental XML โ but visually indistinguishable for a reader).
Why parsing DOCX is not as friendly as parsing HTML
DOCX is a ZIP containing XML. The schema is OOXML (ECMA-376) and python-docx wraps it in a nice paragraph/run object model. The trouble is that most of what makes a real Word document interesting lives outside the friendly object model:
- Page breaks are
<w:br w:type="page"/>elements embedded inside runs. They do appear onparagraph.runs, but as a side-effect a\ncreeps into the run text โ count it once and you have a stray newline. - Hyperlinks live in
<w:hyperlink>elements that are siblings of<w:r>elements inside the paragraph.paragraph.runsskips them entirely โ iterate runs only and the link text vanishes (or stays, but the URL is lost). - Section properties (page size, headers, footers) hang off
sections[*].header / .footerwith cascading inheritance flags (is_linked_to_previous) that the friendly API silently resolves. - Line spacing is a float multiplier on
paragraph_format.line_spacing; mapping it back to the editor's discrete tokens (tight / normal / relaxed / loose / double) requires a snap-to-nearest pass.
If your DOCX import only uses paragraph.runs, your document quietly loses
every hyperlink and every page break the moment it touches your pipeline. Both round-trip
as plain text or vanish entirely. We hit both bugs on the first integration run.
The structure: import_docx_bytes returns six keys
Before this work, our import path returned a 2-tuple (markdown_body,
formatting_array) โ body content plus the stand-off formatting offsets. That was
fine for body-only documents but lost any authored header / footer.
We changed the return type to a six-key dict that mirrors the editor's storage model:
{
"body_md": "# Heading\n\nFirst paragraph...",
"body_formatting": [{"start": 0, "end": 9, "classes": "font-playfair"}, โฆ],
"header_md": "Confidential โ Q3 2026",
"header_formatting": [{"start": 0, "end": 11, "classes": "decoration-bold"}],
"footer_md": "Page",
"footer_formatting": [],
}
Each *_md uses its own zero-based index space; header / footer offsets don't
share with the body. The three walks share one helper (walk_container) that
resets md_parts, formatting and the cursor on entry, so they
don't bleed offsets into each other.
We take sections[0] only (the editor enforces one header / one footer per
document) and honour is_linked_to_previous โ when a section inherits from a
previous one (the default for sections[0]) it has no authored header / footer and we
contribute empty strings. The whole header / footer block is wrapped in try / except, so a
malformed sectPr degrades to "no header / footer" instead of failing the entire
import.
Page breaks: walking the raw XML
python-docx exposes a run's text as run.text โ a concatenation of its
<w:t> children with <w:br> collapsed to \n.
That gives us text but loses the distinction between a soft break and a hard page break.
Fix: walk paragraph._element.iter(qn("w:br")) on every paragraph and check each
break's w:type attribute. When w:type == "page", emit our
[PAGE BREAK]\n\n marker before the paragraph content, so a
page-break-before-heading in Word survives as [PAGE BREAK]\n\n# Heading in our
markdown. If the paragraph contains nothing but the break, skip the empty trailing block to
avoid stray double-newlines, and strip the page-break-induced \n from the run's
text so the marker isn't double-counted.
Hyperlinks: iterate paragraph XML children directly
For hyperlinks, the trick is to give up on paragraph.runs entirely and iterate
the paragraph's XML children, dispatching on tag:
for child in paragraph._element:
tag = etree.QName(child).localname
if tag == "r":
emit_run(child)
elif tag == "hyperlink":
emit_hyperlink(child)
emit_hyperlink resolves r:id against the paragraph's relationship
table to get the external URL, with a fallback to #anchor for internal-link
hyperlinks (TOC entries pointing at heading bookmarks) so the structure survives even when
we can't resolve the bookmark. We emit native markdown [text](url) โ the
renderer's existing markdown-link path produces a real <a href> without
a separate post-pass.
Internal styling inside the hyperlink (bold link text, coloured link text) goes through the
same emit_run pipeline as plain runs, so a bold blue link stays a bold blue
link in the editor.
Line spacing: snap-to-nearest, both directions
The editor doesn't accept arbitrary line-spacing floats โ it has five tokens
(tight / normal / relaxed / loose / double) that map to 1.2 / 1.6 / 2.0
/ 2.5 / 3.0 respectively. Going from the editor to DOCX, the mapping is direct.
Coming back is fuzzier: a Word document authored with 1.5x line spacing (typical for body
copy) should round-trip to linespacing-normal, not get rejected.
Implementation is the inverse of the export-side _nearest_size_token: a
snap-to-nearest pass against the same five reference values. The result is emitted as a
paragraph-level {: .linespacing-X } block-attribute alongside alignment and
indent.
The roundtrip test
Two-way conversions are easy to break and hard to spot. We added
backend/tests/manual_docx_roundtrip.py โ a manual audit script (in the spirit
of the other manual_*.py audits) that does:
doc = build_test_document()
docx_1 = generate_docx_bytes(doc) # Editor -> Word
parsed = import_docx_bytes(docx_1) # Word -> Editor
docx_2 = generate_docx_bytes(parsed) # Editor -> Word again
assert paragraph_count(docx_1) == paragraph_count(docx_2)
assert heading_levels(docx_1) == heading_levels(docx_2)
assert run_properties(docx_1) == run_properties(docx_2)
It's not a pytest test โ it runs against the live backend container and inspects the actual OOXML XML of the two generated docx files. The output is human-readable diffs of what changed across the two passes. We add a new assertion every time a regression is discovered, so the next time the same edge case appears, the audit fails loudly instead of getting silently lost in a complex document.
What is still imperfect
- Multiple sections. A document with section breaks (different headers on chapter 1 vs chapter 2) gets flattened to the first section's chrome. The editor's data model only supports one header / one footer per document, and changing that is a much bigger surgery than the import path alone.
- Comments and tracked changes. Both are present in the OOXML but we drop them on the floor today. The editor has no UI for either, so importing them would just mean discarding them at the next save anyway.
- Images. We do extract image references but only as relinked placeholders. Pulling the actual image bytes out of the DOCX media folder, persisting them, and rewriting the image reference in the markdown is the next pass.
- Custom styles. A document that uses a custom Word style ("Body Indented Quote") that isn't one of the editor's known styles gets the closest match from our style table. Authoritative round-trip of arbitrary custom styles would require carrying their definitions through the editor's data model, which we don't.
What this changes for users
The headline workflow is now symmetric. You can:
- Take an existing Word document โ a draft letter, a research paper, a contract.
- Upload it to AgentDoc (a single click on the "Import .docx" button in the sidebar, or
a single
POST /api/docs/import/docxfor autonomous agents โ see the agent docs). - Edit it by speaking or by chat, with the agent making structured edits while the page layout stays exactly as you authored it.
- Export back to Word, and your collaborator opens the file in the same Word they started with, with the same fonts, the same colours, the same page geometry.
For the agent-side workflow specifically, this also closes the loop where an autonomous
MCP client could only build documents from scratch. With import_docx_bytes
wired up, an agent can ingest a templated DOCX (e.g. a corporate letterhead with
pre-filled fields), drive edits via the MCP tool surface, and export the result โ
exactly the kind of "fill out this form" use case where re-typing from scratch is the
bottleneck.
Companion reads
- Rebuilding PDF + DOCX Export โ the mirror of this post on the way out of the editor.
- Tool Granularity in LLM Agents โ the tool-design framework that drives the MCP surface autonomous agents use against imported documents.