DOCX Import: Round-Tripping Word Documents Through an AI Editor

Engineering May 9, 2026 · 9 min read

Two weeks ago we shipped the DOCX/PDF export rewrite: a document edited in AgentDoc (also: agent doc, agentdocs, docedit) now leaves the editor as a Word file with its fonts, colours, page layout and headers/footers intact. That solved exactly half of the actually-useful workflow. The other half — taking a Word document into the editor without losing structure — is what landed this week.

This post is the companion to the export rewrite: same data model, opposite direction, same insistence on losing nothing the user can see on screen.

The contract

A user uploads contract_v3.docx they've been editing in Word. After import, opening the same document in the editor should show:

Headings, paragraphs, lists, tables — at the right level, in the right order.
Inline formatting — bold, italic, underline, strikethrough, sub/superscript.
Colours and highlights — the actual editor-palette colours, not whatever default Word substituted.
Fonts and sizes — mapped to the editor's twelve tokens (Inter, Playfair Display, Roboto Mono, etc.).
Alignment, indentation, line spacing.
Hyperlinks — clickable, with their original URLs.
Page breaks where the author put them.
The page header and footer, with their own formatting preserved.

Exported back to DOCX, the same file should be byte-comparable in structure (not bit-identical — Word writes a lot of incidental XML — but visually indistinguishable for a reader).

Why parsing DOCX is not as friendly as parsing HTML

DOCX is a ZIP containing XML. The schema is OOXML (ECMA-376) and python-docx wraps it in a nice paragraph/run object model. The trouble is that most of what makes a real Word document interesting lives outside the friendly object model:

Page breaks are <w:br w:type="page"/> elements embedded inside runs. They do appear on paragraph.runs, but as a side-effect a \n creeps into the run text — count it once and you have a stray newline.
Hyperlinks live in <w:hyperlink> elements that are siblings of <w:r> elements inside the paragraph. paragraph.runs skips them entirely — iterate runs only and the link text vanishes (or stays, but the URL is lost).
Section properties (page size, headers, footers) hang off sections[*].header / .footer with cascading inheritance flags (is_linked_to_previous) that the friendly API silently resolves.
Line spacing is a float multiplier on paragraph_format.line_spacing; mapping it back to the editor's discrete tokens (tight / normal / relaxed / loose / double) requires a snap-to-nearest pass.

If your DOCX import only uses paragraph.runs, your document quietly loses every hyperlink and every page break the moment it touches your pipeline. Both round-trip as plain text or vanish entirely. We hit both bugs on the first integration run.

The structure: import_docx_bytes returns six keys

Before this work, our import path returned a 2-tuple (markdown_body, formatting_array) — body content plus the stand-off formatting offsets. That was fine for body-only documents but lost any authored header / footer.

We changed the return type to a six-key dict that mirrors the editor's storage model:

{
  "body_md":           "# Heading\n\nFirst paragraph...",
  "body_formatting":   [{"start": 0, "end": 9, "classes": "font-playfair"}, …],
  "header_md":         "Confidential — Q3 2026",
  "header_formatting": [{"start": 0, "end": 11, "classes": "decoration-bold"}],
  "footer_md":         "Page",
  "footer_formatting": [],
}

Each *_md uses its own zero-based index space; header / footer offsets don't share with the body. The three walks share one helper (walk_container) that resets md_parts, formatting and the cursor on entry, so they don't bleed offsets into each other.

We take sections[0] only (the editor enforces one header / one footer per document) and honour is_linked_to_previous — when a section inherits from a previous one (the default for sections[0]) it has no authored header / footer and we contribute empty strings. The whole header / footer block is wrapped in try / except, so a malformed sectPr degrades to "no header / footer" instead of failing the entire import.

Page breaks: walking the raw XML

python-docx exposes a run's text as run.text — a concatenation of its <w:t> children with <w:br> collapsed to \n. That gives us text but loses the distinction between a soft break and a hard page break.

Fix: walk paragraph._element.iter(qn("w:br")) on every paragraph and check each break's w:type attribute. When w:type == "page", emit our [PAGE BREAK]\n\n marker before the paragraph content, so a page-break-before-heading in Word survives as [PAGE BREAK]\n\n# Heading in our markdown. If the paragraph contains nothing but the break, skip the empty trailing block to avoid stray double-newlines, and strip the page-break-induced \n from the run's text so the marker isn't double-counted.

Hyperlinks: iterate paragraph XML children directly

For hyperlinks, the trick is to give up on paragraph.runs entirely and iterate the paragraph's XML children, dispatching on tag:

for child in paragraph._element:
    tag = etree.QName(child).localname
    if tag == "r":
        emit_run(child)
    elif tag == "hyperlink":
        emit_hyperlink(child)

emit_hyperlink resolves r:id against the paragraph's relationship table to get the external URL, with a fallback to #anchor for internal-link hyperlinks (TOC entries pointing at heading bookmarks) so the structure survives even when we can't resolve the bookmark. We emit native markdown [text](url) — the renderer's existing markdown-link path produces a real <a href> without a separate post-pass.

Internal styling inside the hyperlink (bold link text, coloured link text) goes through the same emit_run pipeline as plain runs, so a bold blue link stays a bold blue link in the editor.

Line spacing: snap-to-nearest, both directions

The editor doesn't accept arbitrary line-spacing floats — it has five tokens (tight / normal / relaxed / loose / double) that map to 1.2 / 1.6 / 2.0 / 2.5 / 3.0 respectively. Going from the editor to DOCX, the mapping is direct. Coming back is fuzzier: a Word document authored with 1.5x line spacing (typical for body copy) should round-trip to linespacing-normal, not get rejected.

Implementation is the inverse of the export-side _nearest_size_token: a snap-to-nearest pass against the same five reference values. The result is emitted as a paragraph-level {: .linespacing-X } block-attribute alongside alignment and indent.

The roundtrip test

Two-way conversions are easy to break and hard to spot. We added backend/tests/manual_docx_roundtrip.py — a manual audit script (in the spirit of the other manual_*.py audits) that does:

doc = build_test_document()
docx_1 = generate_docx_bytes(doc)        # Editor -> Word
parsed = import_docx_bytes(docx_1)       # Word -> Editor
docx_2 = generate_docx_bytes(parsed)     # Editor -> Word again

assert paragraph_count(docx_1) == paragraph_count(docx_2)
assert heading_levels(docx_1) == heading_levels(docx_2)
assert run_properties(docx_1) == run_properties(docx_2)

It's not a pytest test — it runs against the live backend container and inspects the actual OOXML XML of the two generated docx files. The output is human-readable diffs of what changed across the two passes. We add a new assertion every time a regression is discovered, so the next time the same edge case appears, the audit fails loudly instead of getting silently lost in a complex document.

What is still imperfect

Multiple sections. A document with section breaks (different headers on chapter 1 vs chapter 2) gets flattened to the first section's chrome. The editor's data model only supports one header / one footer per document, and changing that is a much bigger surgery than the import path alone.
Comments and tracked changes. Both are present in the OOXML but we drop them on the floor today. The editor has no UI for either, so importing them would just mean discarding them at the next save anyway.
Images. We do extract image references but only as relinked placeholders. Pulling the actual image bytes out of the DOCX media folder, persisting them, and rewriting the image reference in the markdown is the next pass.
Custom styles. A document that uses a custom Word style ("Body Indented Quote") that isn't one of the editor's known styles gets the closest match from our style table. Authoritative round-trip of arbitrary custom styles would require carrying their definitions through the editor's data model, which we don't.

What this changes for users

The headline workflow is now symmetric. You can:

Take an existing Word document — a draft letter, a research paper, a contract.
Upload it to AgentDoc (a single click on the "Import .docx" button in the sidebar, or a single POST /api/docs/import/docx for autonomous agents — see the agent docs).
Edit it by speaking or by chat, with the agent making structured edits while the page layout stays exactly as you authored it.
Export back to Word, and your collaborator opens the file in the same Word they started with, with the same fonts, the same colours, the same page geometry.

For the agent-side workflow specifically, this also closes the loop where an autonomous MCP client could only build documents from scratch. With import_docx_bytes wired up, an agent can ingest a templated DOCX (e.g. a corporate letterhead with pre-filled fields), drive edits via the MCP tool surface, and export the result — exactly the kind of "fill out this form" use case where re-typing from scratch is the bottleneck.

Companion reads

Rebuilding PDF + DOCX Export — the mirror of this post on the way out of the editor.
Tool Granularity in LLM Agents — the tool-design framework that drives the MCP surface autonomous agents use against imported documents.

← Rebuilding PDF + DOCX Export All posts →