UI Code Expert

From Unstructured to Structured Data with HTML 5

The emerging HTML 5 specification provides for structured documents that can support rich hierarchies and enable deep interoperability.

Although HTML 5 is still an emerging technology, its importance should not be underestimated. Major browser vendors are jostling for the lead position in the degree to which they conform to the HTML 5 specification. The discussion around HTML 5's prominence as a potential user interface technology indicates that as soon as the development tools catches up, we will quickly see HTML 5 emerge as one of the leading options under consideration for UI development. This fact is made even more urgent by the lack of support for "plug-ins" like Flash and Silverlight from platforms such as iPhone, iPad and even Linux -- making well used Web sites like youtube.com inaccessible in spite of the sizeable market.

Over the next several UI Code Expert column, I will be drilling down into HTML 5, exploring features like multimedia, new forms functionality, canvas capabilities and more. To start off, we are going to focus on the new HTML 5 document structure and how it converts unidentifiable data into structured data. To begin with, consider the simple HTML 4 document code below:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title></title>
  </head>
  <body>
    <div>
      <div> </div>
      <div> </div>
      <div> </div>
      <div> </div>
  </div>
  </body>
</html>

Although there is certainly some structure in an HTML 4 document -- there is, after all, a header and body section -- the structure fades away once inside the body and we are left with mostly formatting. Given Ids and class names, we could possibly guess at title and header sections within a body along with left and right sections after that. However, HTML 4 itself doesn't explicitly identify these sections and certainly not anything more detailed. In fact, without combining it with CSS, there is little known from the div structure; to any program that loads the page, the body may as well be a set of "<thing>" tags.

In contrast, HTML 5 provides a set of elements specifically designed to address the lack of structure in HTML 4. Specifically, HTML 5 provides the ability to label content within the body. For example, rather than just a collection of divisions, HTML 5 provides support for naming the various divisions and indicating their purpose. The original HTML 4 sample, therefore, can be re-written in HTML 5 using header, nav, article and footer tags, as shown below:

<!DOCTYPE HTML>
<html>
  <head>
    <title></title>
    <meta charset="utf-8" />
  </head>
  <body>
      <header> </header>
      <nav> </nav>
      <article> </article>
      <footer> </footer>
  </body>
</html>

A quick comparison reveals a simplified doctype declaration, but more significantly, it demonstrates some new HTML 5 elements: header, nav, article, and footer. These are just a sample of new elements, but they represent the type of structure commonly available in an HTML document. In fact, as part of coming up with a list of elements to include in HTML 5, the specification writers analyzed pages on the Internet looking to determine the names of classes and ids that were most common on the Internet.

We will postpone a detailed discussion of each of these new HTML 5 elements for a later post. However, it is important to note that the structure doesn't stop with what is in this example. Rather, elements can appear within each other, enabling a rich hierarchy of content. Each article, for example, can contain its own header and footer elements, along with a child article element. Furthermore, there are many more new elements such as blockquote, section, figure, video, and time.

The end result is that using the HTML 5 provides a an outlining algorithm, enabling functionality similar to that found in recent versions of Microsoft Word, whereby a document outline is available and can be navigated by expanding or compressing various nodes in the outline. This can be quite powerful when you consider a table of contents could be generated not only for a page but an entire site.

Furthermore, given an HTML 5 structured page, search engines and other agents can begin to make sense out of the page content they crawl or index. For example, imagine an RSS reader that could subscribe to the article section as providing new content to aggregate. Similarly, screen readers that support HTML 5 can interpret the relevance of various elements far more readily than is possible with HTML 4.

The div tag is still fully supported, but it remains in its HTML 4 role -- that of identifying sections for the purposes of styling them. In contrast, the new HTML 5 elements not only support the same styling that the div tag supported, but also enable the identification of structure and the further interoperability between pages on the Internet.

About the Author

Mark Michaelis (http://IntelliTect.com/Mark) is the founder of IntelliTect and serves as the Chief Technical Architect and Trainer. Since 1996, he has been a Microsoft MVP for C#, Visual Studio Team System, and the Windows SDK and in 2007 he was recognized as a Microsoft Regional Director. He also serves on several Microsoft software design review teams, including C#, the Connected Systems Division, and VSTS. Mark speaks at developer conferences and has written numerous articles and books - Essential C# 5.0 is his most recent. Mark holds a Bachelor of Arts in Philosophy from the University of Illinois and a Masters in Computer Science from the Illinois Institute of Technology. When not bonding with his computer, Mark is busy with his family or training for another triathlon (having completed the Ironman in 2008). Mark lives in Spokane, Washington, with his wife Elisabeth and three children, Benjamin, Hanna and Abigail.

comments powered by Disqus

Featured

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

  • What's New for Python, Java in Visual Studio Code

    Microsoft announced March 2024 updates to its Python and Java extensions for Visual Studio Code, the open source-based, cross-platform code editor that has repeatedly been named the No. 1 tool in major development surveys.

Subscribe on YouTube