UI Code Expert

From Unstructured to Structured Data with HTML 5

The emerging HTML 5 specification provides for structured documents that can support rich hierarchies and enable deep interoperability.

Although HTML 5 is still an emerging technology, its importance should not be underestimated. Major browser vendors are jostling for the lead position in the degree to which they conform to the HTML 5 specification. The discussion around HTML 5's prominence as a potential user interface technology indicates that as soon as the development tools catches up, we will quickly see HTML 5 emerge as one of the leading options under consideration for UI development. This fact is made even more urgent by the lack of support for "plug-ins" like Flash and Silverlight from platforms such as iPhone, iPad and even Linux -- making well used Web sites like youtube.com inaccessible in spite of the sizeable market.

Over the next several UI Code Expert column, I will be drilling down into HTML 5, exploring features like multimedia, new forms functionality, canvas capabilities and more. To start off, we are going to focus on the new HTML 5 document structure and how it converts unidentifiable data into structured data. To begin with, consider the simple HTML 4 document code below:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title></title>
  </head>
  <body>
    <div>
      <div> </div>
      <div> </div>
      <div> </div>
      <div> </div>
  </div>
  </body>
</html>

Although there is certainly some structure in an HTML 4 document -- there is, after all, a header and body section -- the structure fades away once inside the body and we are left with mostly formatting. Given Ids and class names, we could possibly guess at title and header sections within a body along with left and right sections after that. However, HTML 4 itself doesn't explicitly identify these sections and certainly not anything more detailed. In fact, without combining it with CSS, there is little known from the div structure; to any program that loads the page, the body may as well be a set of "<thing>" tags.

In contrast, HTML 5 provides a set of elements specifically designed to address the lack of structure in HTML 4. Specifically, HTML 5 provides the ability to label content within the body. For example, rather than just a collection of divisions, HTML 5 provides support for naming the various divisions and indicating their purpose. The original HTML 4 sample, therefore, can be re-written in HTML 5 using header, nav, article and footer tags, as shown below:

<!DOCTYPE HTML>
<html>
  <head>
    <title></title>
    <meta charset="utf-8" />
  </head>
  <body>
      <header> </header>
      <nav> </nav>
      <article> </article>
      <footer> </footer>
  </body>
</html>

A quick comparison reveals a simplified doctype declaration, but more significantly, it demonstrates some new HTML 5 elements: header, nav, article, and footer. These are just a sample of new elements, but they represent the type of structure commonly available in an HTML document. In fact, as part of coming up with a list of elements to include in HTML 5, the specification writers analyzed pages on the Internet looking to determine the names of classes and ids that were most common on the Internet.

We will postpone a detailed discussion of each of these new HTML 5 elements for a later post. However, it is important to note that the structure doesn't stop with what is in this example. Rather, elements can appear within each other, enabling a rich hierarchy of content. Each article, for example, can contain its own header and footer elements, along with a child article element. Furthermore, there are many more new elements such as blockquote, section, figure, video, and time.

The end result is that using the HTML 5 provides a an outlining algorithm, enabling functionality similar to that found in recent versions of Microsoft Word, whereby a document outline is available and can be navigated by expanding or compressing various nodes in the outline. This can be quite powerful when you consider a table of contents could be generated not only for a page but an entire site.

Furthermore, given an HTML 5 structured page, search engines and other agents can begin to make sense out of the page content they crawl or index. For example, imagine an RSS reader that could subscribe to the article section as providing new content to aggregate. Similarly, screen readers that support HTML 5 can interpret the relevance of various elements far more readily than is possible with HTML 4.

The div tag is still fully supported, but it remains in its HTML 4 role -- that of identifying sections for the purposes of styling them. In contrast, the new HTML 5 elements not only support the same styling that the div tag supported, but also enable the identification of structure and the further interoperability between pages on the Internet.

About the Author

Mark Michaelis started IntelliTect, and serves as chief technical architect and trainer. Since 1996, he's been a Microsoft MVP for C#, Visual Studio Team System and the Windows SDK. In 2007, he was recognized as a Microsoft regional director. You can visit his blog at http://intellitect.com/mark.

comments powered by Disqus

Reader Comments:

Thu, May 12, 2011 Philip South Africa

I agree. Silverlight is a far more elegant architecture.

Wed, May 11, 2011

HTML + Javascript + jQuery + (9 million libraries) = hacking. This is true regardless of how much overblown marketing hype you put behind it.

Add Your Comments Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above