From Unstructured to Structured Data with HTML 5 -- Visual Studio Magazine

From Unstructured to Structured Data with HTML 5

The emerging HTML 5 specification provides for structured documents that can support rich hierarchies and enable deep interoperability.

By Mark Michaelis
05/11/2011

Although HTML 5 is still an emerging technology, its importance should not be underestimated. Major browser vendors are jostling for the lead position in the degree to which they conform to the HTML 5 specification. The discussion around HTML 5's prominence as a potential user interface technology indicates that as soon as the development tools catches up, we will quickly see HTML 5 emerge as one of the leading options under consideration for UI development. This fact is made even more urgent by the lack of support for "plug-ins" like Flash and Silverlight from platforms such as iPhone, iPad and even Linux -- making well used Web sites like youtube.com inaccessible in spite of the sizeable market.

Over the next several UI Code Expert column, I will be drilling down into HTML 5, exploring features like multimedia, new forms functionality, canvas capabilities and more. To start off, we are going to focus on the new HTML 5 document structure and how it converts unidentifiable data into structured data. To begin with, consider the simple HTML 4 document code below:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title></title>
  </head>
  <body>
    <div>
      <div> </div>
      <div> </div>
      <div> </div>
      <div> </div>
  </div>
  </body>
</html>

Although there is certainly some structure in an HTML 4 document -- there is, after all, a header and body section -- the structure fades away once inside the body and we are left with mostly formatting. Given Ids and class names, we could possibly guess at title and header sections within a body along with left and right sections after that. However, HTML 4 itself doesn't explicitly identify these sections and certainly not anything more detailed. In fact, without combining it with CSS, there is little known from the div structure; to any program that loads the page, the body may as well be a set of "<thing>" tags.

In contrast, HTML 5 provides a set of elements specifically designed to address the lack of structure in HTML 4. Specifically, HTML 5 provides the ability to label content within the body. For example, rather than just a collection of divisions, HTML 5 provides support for naming the various divisions and indicating their purpose. The original HTML 4 sample, therefore, can be re-written in HTML 5 using header, nav, article and footer tags, as shown below:

<!DOCTYPE HTML>
<html>
  <head>
    <title></title>
    <meta charset="utf-8" />
  </head>
  <body>
      <header> </header>
      <nav> </nav>
      <article> </article>
      <footer> </footer>
  </body>
</html>

A quick comparison reveals a simplified doctype declaration, but more significantly, it demonstrates some new HTML 5 elements: header, nav, article, and footer. These are just a sample of new elements, but they represent the type of structure commonly available in an HTML document. In fact, as part of coming up with a list of elements to include in HTML 5, the specification writers analyzed pages on the Internet looking to determine the names of classes and ids that were most common on the Internet.

We will postpone a detailed discussion of each of these new HTML 5 elements for a later post. However, it is important to note that the structure doesn't stop with what is in this example. Rather, elements can appear within each other, enabling a rich hierarchy of content. Each article, for example, can contain its own header and footer elements, along with a child article element. Furthermore, there are many more new elements such as blockquote, section, figure, video, and time.

The end result is that using the HTML 5 provides a an outlining algorithm, enabling functionality similar to that found in recent versions of Microsoft Word, whereby a document outline is available and can be navigated by expanding or compressing various nodes in the outline. This can be quite powerful when you consider a table of contents could be generated not only for a page but an entire site.

Furthermore, given an HTML 5 structured page, search engines and other agents can begin to make sense out of the page content they crawl or index. For example, imagine an RSS reader that could subscribe to the article section as providing new content to aggregate. Similarly, screen readers that support HTML 5 can interpret the relevance of various elements far more readily than is possible with HTML 4.

The div tag is still fully supported, but it remains in its HTML 4 role -- that of identifying sections for the purposes of styling them. In contrast, the new HTML 5 elements not only support the same styling that the div tag supported, but also enable the identification of structure and the further interoperability between pages on the Internet.

About the Author

Mark Michaelis (http://IntelliTect.com/Mark) is the founder of IntelliTect and serves as the Chief Technical Architect and Trainer. Since 1996, he has been a Microsoft MVP for C#, Visual Studio Team System, and the Windows SDK and in 2007 he was recognized as a Microsoft Regional Director. He also serves on several Microsoft software design review teams, including C#, the Connected Systems Division, and VSTS. Mark speaks at developer conferences and has written numerous articles and books - Essential C# 5.0 is his most recent. Mark holds a Bachelor of Arts in Philosophy from the University of Illinois and a Masters in Computer Science from the Illinois Institute of Technology. When not bonding with his computer, Mark is busy with his family or training for another triathlon (having completed the Ironman in 2008). Mark lives in Spokane, Washington, with his wife Elisabeth and three children, Benjamin, Hanna and Abigail.

Printable Format

comments powered by Disqus

Featured

You Can Now Apply for Early-Stage AI Agent 'Computer Use' in Copilot Studio

On the way to autonomous AI, Microsoft announced an early access research preview of "computer use" for Copilot Studio wherein AI agents visually interact with any app or website -- clicking, typing, and navigating like a human.
KaleidoSearch Embeds dtSearch Engine

dtSearch, a specialist in enterprise and developer text retrieval software and document filters, announced it's now powering an upgrade to KaleidoSearch from Contegra Systems.
.NET 10 Preview 3 Adds Native Container Publishing

While Preview 3 comes with a wide array of incremental improvements across performance, libraries, CLI tooling, ASP.NET Core/Blazor and more, it doesn't introduce any major new features.
Trending Model Context Protocol for AI Agents Gets C# SDK

The Model Context Protocol (MCP) for agentic AI has gained much traction since being introduced by Anthropic last November, and now it has a C# SDK.
As Agentic AI Explodes, Microsoft Announces MS365 Copilot Agent Debugging

Microsoft announced agent debugging functionality for Microsoft 365 Copilot directly from the AI tool itself, no Visual Studio 2022 or Visual Studio Code needed.