Extract docx headers, footers, text, footnotes, endnotes, properties, and images to a Python object. Show
Note to Users / ContributorsI will be doing very little coding in 2022. I will address "show stopper" bugs in docx2python, and I will accept pull requests if they are complete with
Back to docx2pythonFor a summary of what's new in docx2python 2, scroll down to New in docx2python Version 2 The code is an expansion/contraction of python-docx2txt (Copyright (c) 2015 Ankush Shah). The original code is mostly gone, but some of the bones may still be here. shared features:
additions:
subtractions:
Installationpip install docx2python Usefrom docx2python import docx2python # extract docx content docx2python('path/to/file.docx') # extract docx content, write images to image_directory docx2python('path/to/file.docx', 'path/to/image_directory') # extract docx content, ignore images docx2python('path/to/file.docx', extract_image=False) # extract docx content with basic font styles converted to html docx2python('path/to/file.docx', html=True) Note on html feature:
Return ValueFunction header - contents of the docx headers in the return format described herein footer - contents of the docx footers in the return format described herein body - contents of the docx in the return format described herein footnotes - contents of the docx in the return format described herein endnotes - contents of the docx in the return format described herein document - header + body + footer (read only) text - all docx text as one string, similar to what you'd get from properties
- docx property names mapped to values (e.g., images - image names mapped to images in binary format. Write to filesystem with
docx_reader - a DocxReader (see Return FormatSome structure will be maintained. Text will be returned in a nested list, with paragraphs always at depth 4 (i.e., If your docx has no tables, output.body will appear as one a table with all content in one cell: [ # document [ # table [ # row [ # cell "Paragraph 1", "Paragraph 2", "-- bulleted list", "-- continuing bulleted list", "1) numbered list", "2) continuing numbered list" " a) sublist", " i) sublist of sublist", "3) keeps track of indention levels", " a) resets sublist counters" ] ] ] ] Table cells will appear as table cells. Text outside tables will appear as table cells. A docx document can be tables within tables within tables. Docx2Python flattens most of this to more easily navigate within the content. Working with outputThis package provides several documented helper functions in
the from docx2python.iterators import enum_cells def remove_empty_paragraphs(tables): for (i, j, k), cell in enum_cells(tables): tables[i][j][k] = [x for x in cell if x]
from docx2python.iterators import enum_at_depth def html_map(tables) -> str: """Create an HTML map of document contents. Render this in a browser to visually search for data. :tables: value could come from, e.g., * docx_to_text_output.document * docx_to_text_output.body """ # prepend index tuple to each paragraph for (i, j, k, l), paragraph in enum_at_depth(tables, 4): tables[i][j][k][l] = " ".join([str((i, j, k, l)), paragraph]) # wrap each paragraph in <pre> tags for (i, j, k), cell in enum_at_depth(tables, 3): tables[i][j][k] = "".join(["<pre>{x}</pre>".format(x) for x in cell]) # wrap each cell in <td> tags for (i, j), row in enum_at_depth(tables, 2): tables[i][j] = "".join(["<td>{x}</td>".format(x) for x in row]) # wrap each row in <tr> tags for (i,), table in enum_at_depth(tables, 1): tables[i] = "".join("<tr>{x}</tr>".format(x) for x in table) # wrap each table in <table> tags tables = "".join(['<table border="1">{x}</table>'.format(x) for x in tables]) return ["<html><body>"] + tables + ["</body></html>"]
See helper functions. Some fine print about checkboxes: MS Word has checkboxes that can be checked any time, and others that can only be checked when the
form is locked. The previous print as. New in docx2python Version 2merge consecutive runs with identical formattingMS Word will break up text runs arbitrarily, often in the middle of a word.
This makes things like algorithmic search-and-replace problematic. Docx2python does not currently write docx files, but I often use docx templates with placeholders (e.g., Docx2python v1 merges such runs together when exporting text. Docx2python v2 will merge such runs in the XML as a pre-processing step. This will allow saving such "repaired" XML later on. merge consecutive links with identical hrefsMS Word will break up links, giving each link a different
This is similar to the broken-up runs, but the cause is a little deeper in. Docx2python v1 makes a mess of these.
Docx2python v2 will merge such links together in the XML as a pre-processing step. As above, this will allow saving such "repaired" XML later on. correctly handle nested paragraphsMS Word will nest paragraphs
I haven't been able to create such a paragraph, but I've found a few files that have them. Docx2pyhon v1 will omit closing html tags when a new paragraph is opened before the old paragraph is closed.
Docx2python v2 will correctly handle such cases, but this will require substantial internal changes to the way docx2python opens and closes paragraphs.
paragraph stylesThe internal changes allow for easy access to paragraph styles (e.g.,
export xmlTo allow above-described light editing (e.g., search and replace), docx2python v2 will give the user access to
The user can only go so far with this. A docx file is built from folders full of xml files. None of these xml files are self contained. But search and replace is enough to make document templates (documents with placeholders for data), and that's pretty useful in itself. expose some intermediate functionalityNavigating through XML is straightforward with See the see utilities.py for examples of major new features. |