Inspired by conversations with some smart people at a recent Semantic Web Austin event, I’ve undertaken to restart my education on semantic web technologies like RDF, RDFa, Microformats, etc. When I wear my web developer hat, I’m definitely an advocate of clean semantic markup that correctly describes the structure of the data on the page. These technologies take that approach further (in some cases much, much further). In general, that seems like an unquestionably good idea. More semantic structure means more data portability and data discovery and therefore a more powerful web. It’s probably even a necessary step towards a WebOS.
However, in my limited research to this point, it seems there’s an elephant in the room in all this advocacy. Inevitably discussions of semantic technologies include “better search” as a chief raison d’etre for their use. We’ll have search engines that “understand” the machine readable data on our pages or RDF descriptions which can then draw logical inferences from the relationships among the universe of web resources. But, what if the semantic data is incorrect or just downright dishonest? Over-reliance on easily spammed meta tags gave us garbage in and garbage out in Altavista and Excite back in the 90s. It would be trivial to take my RDFa structured blog post, move it to a spam blog, find the semantically marked-up creator element, change it to someone else and republish. Poof! My finely crafted blog post on the semantic web is now selling ads for herbal remedies to unsuspecting web users with poor search skills. Of course, it’s also easy to just out and out lie when describing content. Maybe I’m not really Angelina Jolie‘s spouse or Bill Gates‘ neighbor even though I swear I am in my XFN standard rel attributes.
I would imagine that one thing that sets these approaches apart from 90s meta tags is the fact that many of these are used to specify relationships between resources which must be symmetric. Angelina’s resource dereferenced from her URI must indicate that I’m her spouse as well for that XFN relationship to be “believed” by a semantic web search that understands XFN. (How Angelina or any of us feel about being boiled down to an authoritative web resource identified by a URI is another issue.) Of course some people will try to game any system but I’m sure the vast majority of web users (or publishing tools) will include this structured data for legitimate purposes. But all this does make me wonder how much search engines will ultimately be able to rely on semantic data for drawing the intelligent inferences we hope to see from them. Can any of you out there that know more about these technologies help me better understand how we can ensure semantic data isn’t telling lies? If so, leave a comment; I’d love to know more.