UI Code Expert

From Unstructured to Structured Data with HTML 5

The emerging HTML 5 specification provides for structured documents that can support rich hierarchies and enable deep interoperability.

Although HTML 5 is still an emerging technology, its importance should not be underestimated. Major browser vendors are jostling for the lead position in the degree to which they conform to the HTML 5 specification. The discussion around HTML 5's prominence as a potential user interface technology indicates that as soon as the development tools catches up, we will quickly see HTML 5 emerge as one of the leading options under consideration for UI development. This fact is made even more urgent by the lack of support for "plug-ins" like Flash and Silverlight from platforms such as iPhone, iPad and even Linux -- making well used Web sites like youtube.com inaccessible in spite of the sizeable market.

Over the next several UI Code Expert column, I will be drilling down into HTML 5, exploring features like multimedia, new forms functionality, canvas capabilities and more. To start off, we are going to focus on the new HTML 5 document structure and how it converts unidentifiable data into structured data. To begin with, consider the simple HTML 4 document code below:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title></title>
  </head>
  <body>
    <div>
      <div> </div>
      <div> </div>
      <div> </div>
      <div> </div>
  </div>
  </body>
</html>

Although there is certainly some structure in an HTML 4 document -- there is, after all, a header and body section -- the structure fades away once inside the body and we are left with mostly formatting. Given Ids and class names, we could possibly guess at title and header sections within a body along with left and right sections after that. However, HTML 4 itself doesn't explicitly identify these sections and certainly not anything more detailed. In fact, without combining it with CSS, there is little known from the div structure; to any program that loads the page, the body may as well be a set of "<thing>" tags.

In contrast, HTML 5 provides a set of elements specifically designed to address the lack of structure in HTML 4. Specifically, HTML 5 provides the ability to label content within the body. For example, rather than just a collection of divisions, HTML 5 provides support for naming the various divisions and indicating their purpose. The original HTML 4 sample, therefore, can be re-written in HTML 5 using header, nav, article and footer tags, as shown below:

<!DOCTYPE HTML>
<html>
  <head>
    <title></title>
    <meta charset="utf-8" />
  </head>
  <body>
      <header> </header>
      <nav> </nav>
      <article> </article>
      <footer> </footer>
  </body>
</html>

A quick comparison reveals a simplified doctype declaration, but more significantly, it demonstrates some new HTML 5 elements: header, nav, article, and footer. These are just a sample of new elements, but they represent the type of structure commonly available in an HTML document. In fact, as part of coming up with a list of elements to include in HTML 5, the specification writers analyzed pages on the Internet looking to determine the names of classes and ids that were most common on the Internet.

We will postpone a detailed discussion of each of these new HTML 5 elements for a later post. However, it is important to note that the structure doesn't stop with what is in this example. Rather, elements can appear within each other, enabling a rich hierarchy of content. Each article, for example, can contain its own header and footer elements, along with a child article element. Furthermore, there are many more new elements such as blockquote, section, figure, video, and time.

The end result is that using the HTML 5 provides a an outlining algorithm, enabling functionality similar to that found in recent versions of Microsoft Word, whereby a document outline is available and can be navigated by expanding or compressing various nodes in the outline. This can be quite powerful when you consider a table of contents could be generated not only for a page but an entire site.

Furthermore, given an HTML 5 structured page, search engines and other agents can begin to make sense out of the page content they crawl or index. For example, imagine an RSS reader that could subscribe to the article section as providing new content to aggregate. Similarly, screen readers that support HTML 5 can interpret the relevance of various elements far more readily than is possible with HTML 4.

The div tag is still fully supported, but it remains in its HTML 4 role -- that of identifying sections for the purposes of styling them. In contrast, the new HTML 5 elements not only support the same styling that the div tag supported, but also enable the identification of structure and the further interoperability between pages on the Internet.

About the Author

Mark Michaelis (http://IntelliTect.com/Mark) is the founder of IntelliTect and serves as the Chief Technical Architect and Trainer. Since 1996, he has been a Microsoft MVP for C#, Visual Studio Team System, and the Windows SDK and in 2007 he was recognized as a Microsoft Regional Director. He also serves on several Microsoft software design review teams, including C#, the Connected Systems Division, and VSTS. Mark speaks at developer conferences and has written numerous articles and books - Essential C# 5.0 is his most recent. Mark holds a Bachelor of Arts in Philosophy from the University of Illinois and a Masters in Computer Science from the Illinois Institute of Technology. When not bonding with his computer, Mark is busy with his family or training for another triathlon (having completed the Ironman in 2008). Mark lives in Spokane, Washington, with his wife Elisabeth and three children, Benjamin, Hanna and Abigail.

comments powered by Disqus

Featured

  • IDE Irony: Coding Errors Cause 'Critical' Vulnerability in Visual Studio

    In a larger-than-normal Patch Tuesday, Microsoft warned of a "critical" vulnerability in Visual Studio that should be fixed immediately if automatic patching isn't enabled, ironically caused by coding errors.

  • Building Blazor Applications

    A trio of Blazor experts will conduct a full-day workshop for devs to learn everything about the tech a a March developer conference in Las Vegas keynoted by Microsoft execs and featuring many Microsoft devs.

  • Gradient Boosting Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the gradient boosting regression technique, where the goal is to predict a single numeric value. Compared to existing library implementations of gradient boosting regression, a from-scratch implementation allows much easier customization and integration with other .NET systems.

  • Microsoft Execs to Tackle AI and Cloud in Dev Conference Keynotes

    AI unsurprisingly is all over keynotes that Microsoft execs will helm to kick off the Visual Studio Live! developer conference in Las Vegas, March 10-14, which the company described as "a must-attend event."

  • Copilot Agentic AI Dev Environment Opens Up to All

    Microsoft removed waitlist restrictions for some of its most advanced GenAI tech, Copilot Workspace, recently made available as a technical preview.

Subscribe on YouTube