DOCX
Dan Tobias (Talk | contribs) (→Other links and references) |
Dan Tobias (Talk | contribs) (→Software) |
||
Line 29: | Line 29: | ||
== Software == | == Software == | ||
* [https://github.com/jkr/docx2pandoc docx2pandoc: translate DOCX to Pandoc output formats] | * [https://github.com/jkr/docx2pandoc docx2pandoc: translate DOCX to Pandoc output formats] | ||
+ | * [http://textract.readthedocs.org/en/latest/ Textract: extract text from various document formats including DOCX] | ||
== Other links and references == | == Other links and references == |
Revision as of 12:13, 5 August 2014
Office Open XML (OOXML) representation (.DOCX) is the default file format for documents created by Microsoft Word as of Word 2007. The format is based on XML component files in a container based on the ZIP format. It replaced the binary DOC format used in earlier Word versions, and comes in two flavours, 'strict' and 'transitional' (see below).
Contents |
History
This (along with the other Office Open XML document types PPTX and XLSX) was initially standardized as ECMA-376 in 2006. Three versions of ECMA-376 have been produced; the second version corresponds to ISO/IEC 29500:2008, approved as an ISO/IEC standard in April 2008. Changes to the standard between 2008 and 2012 were primarily corrections based on individual defects reported as implementation of the standard proceeded and required to ensure functional interoperability with existing applications. They do not introduce new functionality.
Format
High-level structure
Like the other "Open XML" formats, this file format actually consists of various files (mostly XML) compressed into a ZIP archive, with this fact obscured from the end user by the use of a different file extension.
Strict versus Transitional
The OOXML standard actually defines two different format variations: 'strict' and 'transitional' OOXML. The transitional form is not fully specified within the standard documentation, as it is very closely bound to the specific behaviour of Microsoft Office and the older binary formats. The strict form is the fully standardised form, but Microsoft have been slow to fully support OOXML-Strict as the default format for Office documents, leading to interoperability problems. See this blog post for a more detailed look at the interoperability issues, and here for some context from 2014 concerning government support for open formats. Some more commentary is here.
Specs
- ECMA-376 specification
- ISO publicly available standards, including the latest ISO/IEC 29500 specification (as of November 2012, this is ISO/IEC 29500:2012)
Sample files
Software
- docx2pandoc: translate DOCX to Pandoc output formats
- Textract: extract text from various document formats including DOCX