DOCX Import: Round-Tripping Word Documents Through an AI Editor

Engineering ยท 9 min read

Two weeks ago we shipped the DOCX/PDF export rewrite: a document edited in AgentDoc (also: agent doc, agentdocs, docedit) now leaves the editor as a Word file with its fonts, colours, page layout and headers/footers intact. That solved exactly half of the actually-useful workflow. The other half โ€” taking a Word document into the editor without losing structure โ€” is what landed this week.

This post is the companion to the export rewrite: same data model, opposite direction, same insistence on losing nothing the user can see on screen.

The contract

A user uploads contract_v3.docx they've been editing in Word. After import, opening the same document in the editor should show:

Exported back to DOCX, the same file should be byte-comparable in structure (not bit-identical โ€” Word writes a lot of incidental XML โ€” but visually indistinguishable for a reader).

Why parsing DOCX is not as friendly as parsing HTML

DOCX is a ZIP containing XML. The schema is OOXML (ECMA-376) and python-docx wraps it in a nice paragraph/run object model. The trouble is that most of what makes a real Word document interesting lives outside the friendly object model:

If your DOCX import only uses paragraph.runs, your document quietly loses every hyperlink and every page break the moment it touches your pipeline. Both round-trip as plain text or vanish entirely. We hit both bugs on the first integration run.

The structure: import_docx_bytes returns six keys

Before this work, our import path returned a 2-tuple (markdown_body, formatting_array) โ€” body content plus the stand-off formatting offsets. That was fine for body-only documents but lost any authored header / footer.

We changed the return type to a six-key dict that mirrors the editor's storage model:

{
  "body_md":           "# Heading\n\nFirst paragraph...",
  "body_formatting":   [{"start": 0, "end": 9, "classes": "font-playfair"}, โ€ฆ],
  "header_md":         "Confidential โ€” Q3 2026",
  "header_formatting": [{"start": 0, "end": 11, "classes": "decoration-bold"}],
  "footer_md":         "Page",
  "footer_formatting": [],
}

Each *_md uses its own zero-based index space; header / footer offsets don't share with the body. The three walks share one helper (walk_container) that resets md_parts, formatting and the cursor on entry, so they don't bleed offsets into each other.

We take sections[0] only (the editor enforces one header / one footer per document) and honour is_linked_to_previous โ€” when a section inherits from a previous one (the default for sections[0]) it has no authored header / footer and we contribute empty strings. The whole header / footer block is wrapped in try / except, so a malformed sectPr degrades to "no header / footer" instead of failing the entire import.

Page breaks: walking the raw XML

python-docx exposes a run's text as run.text โ€” a concatenation of its <w:t> children with <w:br> collapsed to \n. That gives us text but loses the distinction between a soft break and a hard page break.

Fix: walk paragraph._element.iter(qn("w:br")) on every paragraph and check each break's w:type attribute. When w:type == "page", emit our [PAGE BREAK]\n\n marker before the paragraph content, so a page-break-before-heading in Word survives as [PAGE BREAK]\n\n# Heading in our markdown. If the paragraph contains nothing but the break, skip the empty trailing block to avoid stray double-newlines, and strip the page-break-induced \n from the run's text so the marker isn't double-counted.

Hyperlinks: iterate paragraph XML children directly

For hyperlinks, the trick is to give up on paragraph.runs entirely and iterate the paragraph's XML children, dispatching on tag:

for child in paragraph._element:
    tag = etree.QName(child).localname
    if tag == "r":
        emit_run(child)
    elif tag == "hyperlink":
        emit_hyperlink(child)

emit_hyperlink resolves r:id against the paragraph's relationship table to get the external URL, with a fallback to #anchor for internal-link hyperlinks (TOC entries pointing at heading bookmarks) so the structure survives even when we can't resolve the bookmark. We emit native markdown [text](url) โ€” the renderer's existing markdown-link path produces a real <a href> without a separate post-pass.

Internal styling inside the hyperlink (bold link text, coloured link text) goes through the same emit_run pipeline as plain runs, so a bold blue link stays a bold blue link in the editor.

Line spacing: snap-to-nearest, both directions

The editor doesn't accept arbitrary line-spacing floats โ€” it has five tokens (tight / normal / relaxed / loose / double) that map to 1.2 / 1.6 / 2.0 / 2.5 / 3.0 respectively. Going from the editor to DOCX, the mapping is direct. Coming back is fuzzier: a Word document authored with 1.5x line spacing (typical for body copy) should round-trip to linespacing-normal, not get rejected.

Implementation is the inverse of the export-side _nearest_size_token: a snap-to-nearest pass against the same five reference values. The result is emitted as a paragraph-level {: .linespacing-X } block-attribute alongside alignment and indent.

The roundtrip test

Two-way conversions are easy to break and hard to spot. We added backend/tests/manual_docx_roundtrip.py โ€” a manual audit script (in the spirit of the other manual_*.py audits) that does:

doc = build_test_document()
docx_1 = generate_docx_bytes(doc)        # Editor -> Word
parsed = import_docx_bytes(docx_1)       # Word -> Editor
docx_2 = generate_docx_bytes(parsed)     # Editor -> Word again

assert paragraph_count(docx_1) == paragraph_count(docx_2)
assert heading_levels(docx_1) == heading_levels(docx_2)
assert run_properties(docx_1) == run_properties(docx_2)

It's not a pytest test โ€” it runs against the live backend container and inspects the actual OOXML XML of the two generated docx files. The output is human-readable diffs of what changed across the two passes. We add a new assertion every time a regression is discovered, so the next time the same edge case appears, the audit fails loudly instead of getting silently lost in a complex document.

What is still imperfect

What this changes for users

The headline workflow is now symmetric. You can:

  1. Take an existing Word document โ€” a draft letter, a research paper, a contract.
  2. Upload it to AgentDoc (a single click on the "Import .docx" button in the sidebar, or a single POST /api/docs/import/docx for autonomous agents โ€” see the agent docs).
  3. Edit it by speaking or by chat, with the agent making structured edits while the page layout stays exactly as you authored it.
  4. Export back to Word, and your collaborator opens the file in the same Word they started with, with the same fonts, the same colours, the same page geometry.

For the agent-side workflow specifically, this also closes the loop where an autonomous MCP client could only build documents from scratch. With import_docx_bytes wired up, an agent can ingest a templated DOCX (e.g. a corporate letterhead with pre-filled fields), drive edits via the MCP tool surface, and export the result โ€” exactly the kind of "fill out this form" use case where re-typing from scratch is the bottleneck.

Companion reads

โ† Rebuilding PDF + DOCX Export All posts โ†’