Telling semantic lies

Inspired by conversations with some smart people at a recent Semantic Web Austin event, I’ve undertaken to restart my education on semantic web technologies like RDF, RDFa, Microformats, etc. When I wear my web developer hat, I’m definitely an advocate of clean semantic markup that correctly describes the structure of the data on the page. These technologies take that approach further (in some cases much, much further). In general, that seems like an unquestionably good idea. More semantic structure means more data portability and data discovery and therefore a more powerful web. It’s probably even a necessary step towards a WebOS.

However, in my limited research to this point, it seems there’s an elephant in the room in all this advocacy. Inevitably discussions of semantic technologies include “better search” as a chief raison d’etre for their use. We’ll have search engines that “understand” the machine readable data on our pages or RDF descriptions which can then draw logical inferences from the relationships among the universe of web resources. But, what if the semantic data is incorrect or just downright dishonest? Over-reliance on easily spammed meta tags gave us garbage in and garbage out in Altavista and Excite back in the 90s. It would be trivial to take my RDFa structured blog post, move it to a spam blog, find the semantically marked-up creator element, change it to someone else and republish. Poof! My finely crafted blog post on the semantic web is now selling ads for herbal remedies to unsuspecting web users with poor search skills. Of course, it’s also easy to just out and out lie when describing content. Maybe I’m not really Angelina Jolie‘s spouse or Bill Gates‘ neighbor even though I swear I am in my XFN standard rel attributes.

I would imagine that one thing that sets these approaches apart from 90s meta tags is the fact that many of these are used to specify relationships between resources which must be symmetric. Angelina’s resource dereferenced from her URI must indicate that I’m her spouse as well for that XFN relationship to be “believed” by a semantic web search that understands XFN. (How Angelina or any of us feel about being boiled down to an authoritative web resource identified by a URI is another issue.) Of course some people will try to game any system but I’m sure the vast majority of web users (or publishing tools) will include this structured data for legitimate purposes. But all this does make me wonder how much search engines will ultimately be able to rely on semantic data for drawing the intelligent inferences we hope to see from them. Can any of you out there that know more about these technologies help me better understand how we can ensure semantic data isn’t telling lies? If so, leave a comment; I’d love to know more.

Tags: , , , ,


  1. I’m not someone who espouses a “better search” when describing the benefits of a semantic web. Instead, I like to use the automation example. I ask people to imagine programs running automated queries on items like scientific research papers or published news reports. The program would be able to analyze the connections between discoveries and events independently and then report back when it ‘discovers’ something interesting.

    But, to answer your question, describing data semantically is only half of the equation. If it stopped there, then it really would be no different than other existing meta information out there (meta, tags, categories, even keywords). The other half is being able to write algorithms that can emulate the human abilities of inference and deductive reasoning. A semantic search engine should be able to deduce that a page is not what it purports to be because the data it describes does not correlate well to similarly described data from other sites.

  2. As Ryan suggests, I think “better search” is something of a red herring when it comes to the Semantic Web. To a great extent the kind of search we know and love today (Google etc) is an artifact of generally only having human-readable documents available from which to build indexes.

    But your point is a very good one – as we get more machine-readable data on the Web, questions of trust and provenance become considerably more significant. So far we’ve only really seen “lower case” semantic web approaches in the wild, e.g. I’m likely to trust the tags you use in your feed because I trust you. (Note also that these pieces of metadata are out-of-line, not actually in the documents being referenced).

    If you can make rich descriptions of resources (as Semantic Web technologies allow), then one at least has a way of saying “I believe this” or “this is a lie”. Generally the statements in question are associated with URIs, and with the Resource Description languages there is a means to express levels of confidence in those resources (thus the “upper case” Semantic Web approach).

    There’s a paper around called “Named Graphs, Provenance and Trust” (Google!) which talks of RDF graphs, but if you bear in mind that a HTML doc (at a given URI) can be considered a named graph (named with the URI, the graph being all the machine-readable statements it contains) then there’s a relatively straightforward mechanism for dealing with this stuff.

    While I don’t imagine it occurring outside of specialist applications any time soon (again as Ryan suggests) inference can also help. An advanced example of this is Tim Berners-Lee’s “Oh Yeah?” button, which you press to discover the chain of authority behind any statements you encounter.

  3. Trust has at least been identified as a major part of the Semantic Web. I haven’t read much on it, but came across this resource a bit ago. It’s a daunting list, but has some good research articles that will get your mind working.

    You frame the issue of trust nicely though. Very practical description.

    And I agree with the previous commenters; ‘better search’ isn’t my grand vision of the Semantic Web either. Intelligent agents, improved data quality, and easier ways to exchange data are the things I’m really looking forward to.

  4. […] Troy Williams, CEO of PeoplePad, and Robert Pettengill. Several folks, like John Eric Metcalf and Hayes Davis, had thoughtful write-ups of the event; and Michelle Greer, armed with a flash she borrowed from […]

  5. […] – bookmarked by 5 members originally found by ishiyamania on 2008-11-12 Telling semantic lies – bookmarked by 6 members originally found […]

Leave a Reply

Your email address will not be published. Required fields are marked *