Tech Brief

Open XML

The new default file format in the 2007 Microsoft Office system.

Developers have witnessed a transition from binary file formats to XML since Microsoft Office 2000. Binary files (.DOC, .XLS, and .PPT files), which for years did a great job of storing and transporting data, couldn't keep up with the need to move information between disparate line of business apps and gather insight from that data. The Office Open XML file formats, which are the default formats for the 2007 Microsoft Office system (Word, Excel and PowerPoint), address this need by adopting an XML-based file format. The community site OpenXML Developer.org links to software development kits (SDKs) and examples of solutions written for various platforms.

More Style than Substance
Finding a way to describe the actual meaning contained in a document has been a central focus of the XML community for nearly 20 years, since the technology was Standard Generalized Markup Language (SGML).

Traditionally, the way a document was created didn't include information about its actual content. All that was captured was the content's styling -- its size, whether the words were bold or italicized, the font and so on.

Those of us in the XML field have long believed that if we could separate the actual content, or meaning, from the presentation of a document, then users would be able to "tag" parts of their document with labels.

In a resume, for instance, a user could tag the candidate's name, address, career goals, qualifications and so on. In this way, documents of any kind could become a source of information as rich as a database.

Custom Schemas
Open XML supports custom schemas, which means that users are able to define the structure and the type of content that each data element in a document can contain. The parts of the document -- XML files that describe the application data or meta-data, and binary files, for example -- are packaged in a .ZIP file container.

Data Exchange
Information that was once locked in a structured binary format is accessible with Open XML, and, therefore, documents can serve as exchangeable data sources. The contents of an Office document (in the new file formats) can now be accessed using any tool or technology capable of working with .ZIP archives. The document content can be manipulated using any standard XML-processing techniques or, for parts that exist as embedded native formats, such as images, processed using any appropriate tool for that object type.

In addition, being able to open the container file of a 2007 Microsoft Office system document manually as a .ZIP archive has benefits for developers. For example, when building Office-based solutions, developers can examine the contents and structure of a document without having to write any code.

The container file in the 2007 Microsoft Office system.
[click image for larger view]
The container file in the 2007 Microsoft Office system.

Once inside a 2007 Microsoft Office system document, the structure makes it easy to navigate a document's parts and its relationships, whether it's to locate information, change content or remove elements from a document. The use of XML, along with the published Office reference schemas, means developers can create additional documents, add data to existing documents or search for specific content in a body of documents.

Open XML is designed to be backward compatible with the binary formats, with a Microsoft converter, enabling the migration of existing documents to the new XML-based formats and unlocking the data stored in them. XML-based file formats enable developers to access specific contents within files without having to parse entire documents.

Smaller Files
Open XML documents typically are 50 percent to 75 percent smaller than documents based on the old binary formats. Open XML formats use .ZIP compression technology to store documents, reducing the disk space required to store files and decreasing the bandwidth needed to transport files by e-mail, over networks and across the Web. The file formats' architecture also improves recovery of damaged files.

To help improve security, Open XML allows files with embedded code and macros to be easily identified and isolated. Security is enhanced by isolating instances of embedded code down to those situations where it's enabled by an administrator. Documents can be shared with increased confidentiality because personally identifiable information and business-sensitive information -- user names, comments, tracked changes, file paths -- can be identified and removed.

In December, Ecma International approved Office Open XML as an Ecma standard. The file format is now being considered by the International Standards Organization (ISO) for standardization.
comments powered by Disqus

Featured

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

  • What's New for Python, Java in Visual Studio Code

    Microsoft announced March 2024 updates to its Python and Java extensions for Visual Studio Code, the open source-based, cross-platform code editor that has repeatedly been named the No. 1 tool in major development surveys.

Subscribe on YouTube