Machine-Readable Web Still a Ways Off
- By Joab Jackson
Despite recent initiatives, the possibility of a machine-readable Web extolled by World Wide Web creator Sir Tim Berners-Lee still faces many obstacles, he admitted during a talk at the International Semantic Web Conference, held this week in Chantilly, Va.
Formats such as spreadsheets or even application programming interfaces (APIs) don't do enough to help the reusability of data, he said. Neither are there enough commercial products available to make Web site transitions to the new semantic Web formats easy.
"When you look at putting government data on the Web, one of the concerns is...to not just put it out there on Excel files," he said. "You should put these things in" the Resource Description Framework (RDF).
Berners-Lee has long extolled the virtues of annotating the Web with machine-readable data. This week's conference of semantic Web enthusiasts, however, offered him the chance to discuss in-depth the challenges of getting the rest of the Web to start using the technologies and approaches he advocates.
Few Web site managers are trained in RDF, and not many Web development applications use the standard, Berners-Lee admitted. "I'm not sure we have a grasp of our needs for the next phase of products," he said. He implored the semantic Web community in the audience to educate and inspire their peers. The people they need to talk to, he said, "are not going to be found in these corridors," referring to the conference attendees themselves.
Part of the issue is the inherent complexity of the concept of the semantic Web, which was Berners-Lee's original name for his concept of a machine-readable Web. Even simple sets of data linked by RDF, which was one simple component of his grand vision, "is still remarkably difficult as a paradigm shift," he said.
During the Q&A period, an audience member asked why exposing the API isn't sufficient for exposing data. Berners-Lee pointed out that to use an API, a system administrator or developer must write some sort of program to get at the data. With RDF, the data should be able to be reused directly within the browser itself, involving no additional work on the part of the user.
Berners-Lee noted that if the Web manager uses common uniform resource identifiers to identify people, cities or countries in the data, the browser could automatically pull information from other Web sites about these entities. "So there is very much more value to data for me, if I'm just browsing," he said.
He said that the use of RDF should not require building new systems or changing the way site administers work, reminiscing about how many of the original Web sites were linked back to legacy mainframe systems. Instead, scripts can be written in Python, Perl or other languages that can convert data in spreadsheets or relational databases into RDF for the end users. "You will want to leave the social processes in place, leave the technical systems in place," he said.
Conferences attendees admitted that the idea of the machine-readable data can be hard sell to those unfamiliar with the idea. The idea of linked data, like the idea of a World Wide Web itself, "solves a problem we didn't know we had," said Ronald Reck, head of the consulting firm Rrecktek. In other words, many of the benefits offered by the then-nascent Web, such as the ability to share documents, were already offered through other technologies, such as the File Transfer Protocol. Likewise, it is difficult to understand the concept of a single format for Web-based data when plenty of formats such as relational databases and spreadsheets already annotate data in ways that make it reusable by other systems.
Nonetheless, the idea of enabling the semantic Web so it can be shared seems to be gaining at least some traction, not the least because of efforts that disregard some of its more advanced notions, such as ontology-building, in favor of simply linking data sources.
Elsewhere at the conference, some researchers from the Rensselaer Polytech Institute demonstrated how they re-rendered all the data from the Data.gov Web sites into RDF. Their work was partially funded by grants from the Defense Advanced Research Projects Agency and the National Science Foundation.
"Our goal is to make the whole thing shareable and replicable for others to re-use," said project researcher Li Ding.
By rendering data into RDF, it can be more easily interposed with other sets of data to create entirely new datasets and visualizations, Ding said. He showed a Google Map-based graphic that interposed RDF versions of two different data sources from the Environmental Protection Agency, originally rendered in .CSV files.
The graphic derives the new material by linking common elements from the two datasets, Ding explained. The map shows the levels of ozone depletion across the country, the severity of the depletion marked by the circumference of the bubbles over the area where the readings were taken. One data set contained the ozone readings, while the other data source contained the geographical locations of where the readings were taken. The map data was created by joining these two sets of data by their common element -- namely, the names of the locations where the readings took place.
The Rensselaer project is one of a number of interrelated efforts. Linked Open Data, a directory of RDF stores, has documented at last count over 4.2 billion assertions encoded in RDF across a wide variety of sources, such as GeoNames and DBpedia.
Joab Jackson is the chief technology editor of Government Computing News (GCN.com).