Tech Brief

Open XML

The new default file format in the 2007 Microsoft Office system.

Developers have witnessed a transition from binary file formats to XML since Microsoft Office 2000. Binary files (.DOC, .XLS, and .PPT files), which for years did a great job of storing and transporting data, couldn't keep up with the need to move information between disparate line of business apps and gather insight from that data. The Office Open XML file formats, which are the default formats for the 2007 Microsoft Office system (Word, Excel and PowerPoint), address this need by adopting an XML-based file format. The community site OpenXML Developer.org links to software development kits (SDKs) and examples of solutions written for various platforms.

More Style than Substance
Finding a way to describe the actual meaning contained in a document has been a central focus of the XML community for nearly 20 years, since the technology was Standard Generalized Markup Language (SGML).

Traditionally, the way a document was created didn't include information about its actual content. All that was captured was the content's styling -- its size, whether the words were bold or italicized, the font and so on.

Those of us in the XML field have long believed that if we could separate the actual content, or meaning, from the presentation of a document, then users would be able to "tag" parts of their document with labels.

In a resume, for instance, a user could tag the candidate's name, address, career goals, qualifications and so on. In this way, documents of any kind could become a source of information as rich as a database.

Custom Schemas
Open XML supports custom schemas, which means that users are able to define the structure and the type of content that each data element in a document can contain. The parts of the document -- XML files that describe the application data or meta-data, and binary files, for example -- are packaged in a .ZIP file container.

Data Exchange
Information that was once locked in a structured binary format is accessible with Open XML, and, therefore, documents can serve as exchangeable data sources. The contents of an Office document (in the new file formats) can now be accessed using any tool or technology capable of working with .ZIP archives. The document content can be manipulated using any standard XML-processing techniques or, for parts that exist as embedded native formats, such as images, processed using any appropriate tool for that object type.

In addition, being able to open the container file of a 2007 Microsoft Office system document manually as a .ZIP archive has benefits for developers. For example, when building Office-based solutions, developers can examine the contents and structure of a document without having to write any code.

The container file in the 2007 Microsoft Office system.
[click image for larger view]
The container file in the 2007 Microsoft Office system.

Once inside a 2007 Microsoft Office system document, the structure makes it easy to navigate a document's parts and its relationships, whether it's to locate information, change content or remove elements from a document. The use of XML, along with the published Office reference schemas, means developers can create additional documents, add data to existing documents or search for specific content in a body of documents.

Open XML is designed to be backward compatible with the binary formats, with a Microsoft converter, enabling the migration of existing documents to the new XML-based formats and unlocking the data stored in them. XML-based file formats enable developers to access specific contents within files without having to parse entire documents.

Smaller Files
Open XML documents typically are 50 percent to 75 percent smaller than documents based on the old binary formats. Open XML formats use .ZIP compression technology to store documents, reducing the disk space required to store files and decreasing the bandwidth needed to transport files by e-mail, over networks and across the Web. The file formats' architecture also improves recovery of damaged files.

To help improve security, Open XML allows files with embedded code and macros to be easily identified and isolated. Security is enhanced by isolating instances of embedded code down to those situations where it's enabled by an administrator. Documents can be shared with increased confidentiality because personally identifiable information and business-sensitive information -- user names, comments, tracked changes, file paths -- can be identified and removed.

In December, Ecma International approved Office Open XML as an Ecma standard. The file format is now being considered by the International Standards Organization (ISO) for standardization.
comments powered by Disqus

Featured

  • Compare New GitHub Copilot Free Plan for Visual Studio/VS Code to Paid Plans

    The free plan restricts the number of completions, chat requests and access to AI models, being suitable for occasional users and small projects.

  • Diving Deep into .NET MAUI

    Ever since someone figured out that fiddling bits results in source code, developers have sought one codebase for all types of apps on all platforms, with Microsoft's latest attempt to further that effort being .NET MAUI.

  • Copilot AI Boosts Abound in New VS Code v1.96

    Microsoft improved on its new "Copilot Edit" functionality in the latest release of Visual Studio Code, v1.96, its open-source based code editor that has become the most popular in the world according to many surveys.

  • AdaBoost Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the AdaBoost.R2 algorithm for regression problems (where the goal is to predict a single numeric value). The implementation follows the original source research paper closely, so you can use it as a guide for customization for specific scenarios.

  • Versioning and Documenting ASP.NET Core Services

    Building an API with ASP.NET Core is only half the job. If your API is going to live more than one release cycle, you're going to need to version it. If you have other people building clients for it, you're going to need to document it.

Subscribe on YouTube