World Wide Web Consortium's Ivan Herman talks about the Semantic Web

    1 of 1 2 of 1

      Ivan Herman says the Semantic Web will lead to “mash-ups on steroids”. As the World Wide Web Consortium’s Semantic Web activity lead, Herman has a lot of influence on the development of Web 3.0 technologies. The computer scientist has coordinated all of the standards body’s work on the Semantic Web since 2006.

      The Georgia Straight reached Herman by phone in January at his home in Amsterdam. What follows is a partial transcript of the interview.

      What is the Semantic Web?

      The shortest way is to say it is a data Web. Now, that by itself doesn’t say too much. But if you look at the Web as of today, it’s sort of a Web of documents. People put up documents in HTML or generated from a database or whatever that are linked together and humans look at it. In this sense, it’s a bunch of documents that are on the Web. But in fact, there’s a huge amount of data, and you would like to have the data relate to one another directly. So, you have big databases, and the databases together represent knowledge and information, not necessarily individually, and you want the same kind of linkage among the data the way you do it with documents. In this sense, it’s the Web of data. You also find people calling it the data Web—more or less the same thing. The Semantic Web technologies, the various things that we do, are all the building blocks to realize that properly. That’s, if you like, the main thrust about it. Of course, there are always various things that people do with it, which is not necessarily directly into this line, but that’s fine. That’s the way things work...

      What you can imagine is applications—mash-ups on steroids, as I call them sometimes, because in the mash-ups after all what they do is they pick data from all over the place, and they put them together and create you a site where you see the result of collecting data all over the Web. But if you look a little bit inside what a mash-up site has to do, it has to go for each dataset or database or whatever is there with its own propriety method and ways to find information and scraping a Web page, and this database exports it that way and that database exports it that way, et cetera. So, they have to reinvent the wheel many, many times. If instead of that, the data was available the same way as documents are available in HTML with links and the same way that the Web of data was there, then mash-up sites could be built easily—much more easily than today—by combining the various types of data together. That’s one way of looking at it.

      Do you think RDF will be basis of the Semantic Web?

      I think it is. Now, you have to be very precise about that, because there are some misconceptions, and frankly maybe bad messages from the community side as well, because RDF has an abstract model, a general approach. Then RDF has serialization, let’s say, in XML. A serialization, what is called RDF/XML, is a very complicated thing. For many people, it’s way too complicated...

      That distinction became much clearer when RDF was republished in 2004. The core model is really the essential part, and I really believe that this is in fact a relatively simple thing. That’s the core of that. Whether the RDF/XML serialization will survive the years or there will be others coming up—there are already others—that’s a different question if you like. Today when, let’s say, you generate RDF directly from a database, then very often you don’t even touch RDF/XML. But, yes, the RDF core, I believe, is at the heart of what’s happening there. Again, the RDF core itself is not that complicated. It’s the RDF/XML side that in some way obscured the whole thing, and that is partially our mistake...

      The 1998 version sort of mixed up the two things together. If you look at the 2004 version, then these two things are very clearly separated. But, by 2004, part of the community had been hit by the XML version of it and drew some negative conclusions...

      It’s the core model, the triple model, which is important. The serialization is just syntax. There might be better syntaxes, and there are better syntaxes. We may look at a new XML syntax at some point maybe. We don’t know. But we may have to do that. But that’s only the syntax.

      How important is RDFa, which came out in fall?

      In some ways, RDFa can be considered as another serialization of the same model. It’s not necessarily for that that you would use it, but it is a possible serialization. I think it’s a very, very essential thing—very essential technology—because, well, one of the obvious sources for getting RDF information is when you mix it up with an HTML file. The fact that, with a few attributes, you can add information to an HTML file and then you can extract the information in RDF is very essential...

      One of my colleagues had this very cute comparison. It’s like a credit card or a bank card, which on the one hand you can take it and you can read it and you see the name and you see the number and you see the other number on the back—I don’t know what they call it. You can see all that, and that’s the way you look at the credit card. But at the same time, if you have an electronic reader, the electronic reader will read the same data but in a different format from the same card. It’s a little bit like that. You have an HTML file with information for humans, and a special reader or distiller can extract the information in RDF.

      Do you think FOAF is going to be very important?

      In practice, it is already. Quite a number of places use it. It’s still something that has to evolve, and there are a number of things that are missing from it. Now, FOAF is not under W3C, so it’s not for me to say. It’s more a community development. I think it’s very important, and it has de facto imposed itself as a vocabulary for this basic information about persons. I know that there are missing things, and I know that it’s evolving constantly. The guy who is really at the heart of it, Dan Brickley, is still around and really tries to push it.

      Is people’s arguing about RDFa and microformats becoming a problem?

      Well, it’s not a problem. It depends what you want to do. For very simple things, microformats are pretty okay, and I don’t have any problem with that. In fact, if you forget about the controversy—I don’t like the controversies—both do essentially the same thing: add information. There are technologies. We have something, which is called GRDDL, that you can run and you can extract information from microformats. That is fine. The problem with microformat is that it doesn’t scale, because either if your vocabulary’s too complex then it’s very difficult to map it against the microformat technology, the microformat approach. If you have a scientific text and you want to add information so that it abides to some vocabulary in a library or something like that, then microformat doesn’t work anymore. It’s too simplistic. The other big problem is that if you have, in the same file, you want to mix up two or three different microformats, then you have a clash. Microformats is not prepared for that. The FOAF file that I use uses in fact a whole load of other vocabularies in the same file. Geography vocabularies and security vocabularies and God knows what else. If you use microformats for that, then you have a clash. That’s where RDFa can come in, because RDFa scales. RDFa is in fact oblivious to the number of vocabularies you want to use. It’s just the same thing. It doesn’t have this kind of problem. I don’t want to go into this controversy. I don’t see that it’s either that or the other one. Depending on the application area, both can happily coexist, and there are technologies to make them really live in the same space...

      GRDDL creates the bridge. That was not only but one of the main reasons why GRDDL was developed: to make this bridge. In some ways, with this bridge there is no reason to make a controversy around that.

      What are some of the big challenges overcome and yet to come to make the Semantic Web widespread?

      Things that are under way and under work and are clearly still not solved is how to bridge, for example, two major big databases—in general, how to bridge two relational databases. Whether this bridge that has to be built should be standardized or not, or whether you let the market play...

      Maybe there are some areas of that that have to be standardized, some others that can be left to the market. But certainly, a huge amount of data is there, so this is one big area. The practicality of scaling, the practicality of reasoning over large scale, large amounts of data is coming more and more to the fore because there are a number of movements which put public data into RDF. But today on the Web, you have billions and billions of triples that are available, and there are applications that begin to make use of those. But how reasoning could work on that, whether we have to redefine the notion of completeness, whether we really want to have all the results, or you want to have only the ones that you can get very quickly, et cetera. All these issues are sort of restated, and there are lots of developments. Part of that is really still research and development. On the other hand, we have now technology with which really big ontologies can be built. If you need big ontologies, then those can be built, and people build them. OWL is there and there are very good reasoners around, so this issue has nicely evolved over the year, even if there are new features that people require. So, we are working on a new version of OWL. So, that has certainly been a big success. The fact itself that today we can have triple stores in which millions and millions of triples can be stored is a major success. I remember when, a few years ago, you looked around the tools and if you had a tool that could handle a few hundred thousand triples, then we were all very happy and it was a big thing...

      What data, what part of the data, which triple should be available to whom, how the access control mechanisms that are already in place on the Web would work with this architecture. There are things done. But there are still things to be done there certainly...

      Some of the mechanisms that are already on the Web for access control work for this as well. So, this is not starting from zero. But there are details to work out and systems to really make use of that, and that still has to crystallize.

      What’s most exciting to you about the Semantic Web?

      The fact that you have a huge amount of data and the kinds of things you are able to do by the integration of that data is something really exciting. You see sometimes applications that have an almost science fiction kind of thing in it...

      That’s doable because all those data is around and you can integrate it, you can create that Web, and there are machines that can do it for you and can make it visible. That’s really, really impressive, and there are a bunch of things that you can do. Some of those things we can’t even really grasp. That’s really very exciting. That’s a beautiful thing.

      How do you envision regular people using this in five to 10 years?

      Well, they will not know that they use the Semantic Web. They don’t want to know. When I said mash-ups on steroids, that’s one thing they will see—that they will have much more exciting services available to them without them knowing how it works...

      People will see such engines at their disposal. They will not know what the technology is behind it, and, of course, they are not interested by that. But they will see this kind of thing. They may have much better personalizable and personalized systems that they can do, and that can be done because there will be a much finer way of describing the various personal wishes and profiles that they want to define, et cetera. People very often ask me, “What is the killer app?”, and I have no idea. I don’t think that there will be one application. This is the kind of technology that will be behind a screen. It is always behind a screen, and people in this sense will not see it right away.

      You can follow Stephen Hui on Twitter at twitter.com/stephenhui.

      Comments