Tech Brief
Open XML
The new default file format in the 2007 Microsoft Office system.
Developers have witnessed a transition from binary file formats to XML since Microsoft Office 2000. Binary files (.DOC, .XLS, and .PPT files), which for years did a great job of storing and transporting data, couldn't keep up with the need to move information between disparate line of business apps and gather insight from that data. The Office Open XML file formats, which are the default formats for the 2007 Microsoft Office system (Word, Excel and PowerPoint), address this need by adopting an XML-based file format. The community site OpenXML Developer.org links to software development kits (SDKs) and examples of solutions written for various platforms.
More Style than Substance
Finding a way to describe the actual meaning contained in a document has been a central focus of the XML community for nearly 20 years, since the technology was Standard Generalized Markup Language (SGML).
Traditionally, the way a document was created didn't include information about its actual content. All that was captured was the content's styling -- its size, whether the words were bold or italicized, the font and so on.
Those of us in the XML field have long believed that if we could separate the actual content, or meaning, from the presentation of a document, then users would be able to "tag" parts of their document with labels.
In a resume, for instance, a user could tag the candidate's name, address, career goals, qualifications and so on. In this way, documents of any kind could become a source of information as rich as a database.
Custom Schemas
Open XML supports custom schemas, which means that users are able to define the structure and the type of content that each data element in a document can contain. The parts of the document -- XML files that describe the application data or meta-data, and binary files, for example -- are packaged in a .ZIP file container.
Data Exchange
Information that was once locked in a structured binary format is accessible with Open XML, and, therefore, documents can serve as exchangeable data sources. The contents of an Office document (in the new file formats) can now be accessed using any tool or technology capable of working with .ZIP archives. The document content can be manipulated using any standard XML-processing techniques or, for parts that exist as embedded native formats, such as images, processed using any appropriate tool for that object type.
In addition, being able to open the container file of a 2007 Microsoft Office system document manually as a .ZIP archive has benefits for developers. For example, when building Office-based solutions, developers can examine the contents and structure of a document without having to write any code.
Once inside a 2007 Microsoft Office system document, the structure makes it easy to navigate a document's parts and its relationships, whether it's to locate information, change content or remove elements from a document. The use of XML, along with the published Office reference schemas, means developers can create additional documents, add data to existing documents or search for specific content in a body of documents.
Open XML is designed to be backward compatible with the binary formats, with a Microsoft converter, enabling the migration of existing documents to the new XML-based formats and unlocking the data stored in them. XML-based file formats enable developers to access specific contents within files without having to parse entire documents.
Smaller Files
Open XML documents typically are 50 percent to 75 percent smaller than documents based on the old binary formats. Open XML formats use .ZIP compression technology to store documents, reducing the disk space required to store files and decreasing the bandwidth needed to transport files by e-mail, over networks and across the Web. The file formats' architecture also improves recovery of damaged files.
To help improve security, Open XML allows files with embedded code and macros to be easily identified and isolated. Security is enhanced by isolating instances of embedded code down to those situations where it's enabled by an administrator. Documents can be shared with increased confidentiality because personally identifiable information and business-sensitive information -- user names, comments, tracked changes, file paths -- can be identified and removed.
In December, Ecma International approved Office Open XML as an Ecma standard. The file format is now being considered by the International Standards Organization (ISO) for standardization.