In-Depth

Access Web Data Programmatically

Use the power of the DHTML Document Object Model to access elements of your Web Browser documents and automate data gathering.

Technology Toolbox: VB.NET

Finding information on the Web is still a manual process, by and large. The problem is compounded if your search requires clicking through Web page after Web page. Users benefit greatly from automating their searches. And developers who know how to provide programmatic Web data access should enjoy widespread demand for their services.

Suppose you're creating a point-of-sale system and you'd like it to display a suggested price whenever a user enters a new inventory item. Moreover, suppose that accomplishing this requires your application to access several competitors' Web sites transparently, searching for the item in question, then retrieving pricing information. A user would need several minutes to do this manually, while a program could do it in seconds—if you know how to code this process.

Fortunately, you can use the Web Browser control and its best friend the Dynamic HTML Document Object Model (DHTML DOM) to mine data from the Internet easily. These tools can enable your apps to efficiently surf the Net and extract the information you need without user intervention. An app can search through tables of data, fill in and submit forms, and even rip images from Web pages. None of these steps are hard to code, but you do have to understand DOM.

I started to learn more about DOM when I needed to write a load-testing application for a software system I'd been working on. I wanted to simulate a user logging into a Web site, making some requests, and logging out. I ran this app on several client PCs at once to ensure that the specialized back-end server software (which created PDF files on the fly) could handle the load of multiple users making simultaneous requests.

Unfortunately, I found few resources about how to do this in books or on the Internet. It was one hurdle after another, especially when it came to dealing with a form, automating the login process, analyzing the contents of a Web page, and simulating a user clicking on a link. Each of these tasks seemed a bit daunting at first, due to the lack of information available. But I found they're all pretty easy to do once you learn a few secrets.

Fortunately, you start with something familiar: the Web Browser control, one of the more useful components in Visual Basic's Toolbox. Drop the control onto your form, add a few lines of code, and you have an instant Web browser. Its various properties and methods make navigation a snap. It also provides access to individual elements of the currently loaded page through its Document property, your gateway to the DHTML DOM.

The Document property returns a reference to the automation object of the active document, so the model it exposes can change depending on the kind of page being accessed. For example, the Document property returns a reference to the DHTML DOM—the focus of this article—when you access an HTML document. Other document types might give you entirely different object models. For example, you'd have to deal with the Microsoft Office DOM if you were loading an Excel document.

Organize HTML Docs With DOM
Microsoft's DHTML DOM implementation stems from the architecture set forth in the World Wide Web Consortium (W3C) DOM specification (see the sidebar, "You Never Know What You're Going to Get"). The DOM organizes the elements of an HTML document into a hierarchy where every element (or node) has a parent and can have its own children.

Consider a Web page that displays a table of foreign exchange rate multipliers. The first HTML element in such a document is, appropriately, the <HTML> tag, which contains a single child element, the <BODY> tag (see Listing 1). This in turn has one child element, the <TABLE> tag, which has four children (the <TR> tags). Each <TR> element has five children (the <TD> tags). End tags are considered part of the opening tag and are ignored, so there's no node for the </HTML>, </BODY>, </TABLE>, </TR>, or </TD> tags. Also, the DOM may provide a little more than what was in the document's source, because tags such as <HEAD> and <TITLE> are added even if they weren't present. You also get comments and unknown or misspelled tags.

This hierarchy is represented in the DOM interface through collections. Every element object has a ChildNodes collection comprising the document element objects it contains. The FirstChild and LastChild properties provide references to the element's first and last child nodes, respectively, while the ParentNode property provides a reference back to the element's parent.

An All collection exposed by the DOM contains all document element objects grouped together. This is true in most cases, but sometimes I've found this collection to be inconsistent, depending on the kind of page you're loading. I recommend using the ChildNodes hierarchy I just explained to provide navigation of document elements.

Most element objects have other properties that provide important information about the document element. The TagName property returns the actual HTML tag (minus the opening and closing brackets), while the InnerHTML property gives you the HTML source contained within the element. Usually you can retrieve the source for the entire document by accessing the Document's first child node:

sSource = _
   WebBrowser1.Document.ChildNodes(0). _
   InnerHTML

I use a recursive subroutine to list all the HTML tags in a document (see Listing 2). This approach requires some error trapping because not all elements support the TagName property.

Elements expose properties based on the parameters available to their tag type. For example, an <IMG> element has properties such as SRC and BORDER, while an <A> element has properties such as HREF or NAME.

You can employ the GetElementByID function to access elements whose names you know directly. This lets you obtain needed information without having to search through all of a document's element nodes. For example, you could retrieve the URL for the "CompanyLogo" image by retrieving its SRC property:

sImageURL = _ 
   WebBrowser1.Document. _
   GetElementByID("CompanyLogo").SRC

The DOM provides a manifest of the individual elements in an HTML document. It also contains several collections of document objects based on their class type, including Anchors, Applets, Embeds, Forms, Images, Links, and Plugins (see Figure 1). These special collections come in handy when you want to deal exclusively with certain aspects of the document. All the DOM's collections include a Length property that returns the total number of elements it contains. All the collections are zero-based.

Sift Through Tables
You're most likely to find the Web page data you need to extract inside a table. You can pass table data to your application by looping through the element objects of the HTML document and extracting the contents of its <TD> tags, using the InnerHTML property.

Write a subroutine that takes data from the Table.htm file you built earlier (see Listing 1) and displays it in a TextBox control (see Figure 2 and Listing 3). Looping through the elements' child nodes recursively lets you cycle methodically through the entire document and process elements based on their tag type. Searching for table-specific tags lets you "shake the tree" quickly for the data you need to extract.

Now you have a page loaded into the Web Browser control. You can find ideas about where to go next in the DOM's Links collection. This collection contains every A and AREA object in the document that has an HREF attribute. Anchor tags lack the HREF attribute but have a NAME or ID attribute, and are stored in the appropriately named Anchors collection.

Simulate a click on a link by calling the Link object's Click method. For example, here's how to jump to the URL specified by the third link on the page:

WebBrowser1.Document.Links(2).Click

While writing my load-testing app, I wanted to mix up the options my phantom user clicked on, so I created a subset of the Links collection with only the links I wanted to use. I looped through the DOM's links and added to my collection the ones satisfying certain criteria in their HREF properties. Then my code selected one at random and called its Click method. This allowed my application to simulate user activity more realistically.

At this point, you probably need to code access to forms. It could be a search box or a username/password interface, but you must fill in and submit the form programmatically. Fortunately, the DOM lets you access form fields directly—including textboxes and buttons. This enables you to add an auto-fill feature to a Web browser app so your users don't have to enter form data over and over again.

You might know already the name of the object you'd like to access. If so, obtain a reference by passing its name to GetElementByID. For example, consider a simple login box (see Figure 3). It has two textboxes (txtUsername and txtPassword) and an image button (imbLogin). Set the values of the textboxes by passing a string to their Value properties:

WebBrowser1.Document. _
   GetElementByID("txtUsername"). _
   Value = "Name"

WebBrowser1.Document. _
   GetElementByID("txtPassword"). _
   Value = "test"

You can also retrieve the current value of a textbox by reading its Value property:

sUserName = _ 
   WebBrowser1.Document. _
   GetElementByID("txtUsername").Value 

Of course, you might not know the name of the form object you want to access. In this case, get an inventory of the elements available to you by looping through the DOM's Forms collection, then through each Form object's ChildNodes collection.

Simulate a mouse click on the form's Submit button by calling the button's Click method, or use the GetElementByID function to access a known Button object:

Call WebBrowser1.Document. _
   GetElementByID("imbLogin").Click()

Get the Picture(s)
Sometimes the data you want is an image. Ripping images from a Web site is easy—the DOM's Images collection stores parameters for each IMG tag in an HTML document. Simply obtain the image's URL (stored in the SRC parameter) and download the file.

The DOM doesn't facilitate downloading images, but the .NET Framework includes the DownloadFile method of the System.Net.WebClient class, which makes it easy to obtain a file from a given URL:

Public Sub DownloadFile(ByVal sURL As _
   String, sFileName As String)

   Dim oClient As System.Net.WebClient
   oClient.DownloadFile(sURL, _
      sFileName) 

End Sub

You can create an image ripper utility that loops through the DOM's Images collection and utilizes the WebClient's DownloadFile method. This extracts graphics from a given Web page. You can also use an image ripper utility to download other file types, such as EXEs, PDFs, and DOCs. You'll need to get permission to download copyrighted items, including many images—and that process can't be programmed.

You can learn more about programmatic data access by experimenting with some of the DOM's other collections, such as Applets and Scripts. You can also try to load in different types of documents to see how the object model changes. I've learned that it takes hands-on practice with DOM to truly learn its features and foibles.

About the Author

Rob Thayer has been programming in VB since its first version and even fiddled with VB for DOS many years ago. A freelance programmer and author of several programming books, he currently lives in Phoenix. Reach him at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube