Sunday, 1 May 2011

A proposed criterion for what counts as semantic in the Semantic Web

Last year Richard MacManus introduced the Modigliani test as a "semantic web tipping point".  His argument was, "The tipping point for the long-awaited Semantic Web may be when you can query a set of data about someone not too famous, and get a long list of structured results in return."  This would represent significant progress in the implementation of semantic web technology and, to be sure, structured data would be very helpful in realizing an effective implementation of the semantic web. Being able to integrate this information from disparate sources is exactly the sort of thing that the implementation of the semantic web is supposed to help us realize.

However, this example as a semantic web tipping point gives pause because realizing it doesn't, to my mind, necessarily require too much in terms of semantics. Similarly, many alleged examples of semantic search leave me unsatisfied. Consider a recent discussion entitled "Exploring the Semantics of Yahoo Direct Search" -- the discussion points to the categorization of results and the ability to auto-complete queries and/or hint at the results they might produce. But none of these things seem particularly semantic or at least not necessarily semantic.

Of course, "semantics" is notoriously vague and under-specified in the context of semantic web and semantic search discussions.  It certainly doesn't mean the same thing as what we're discussing when we consider, say, the semantics of first order logic.  Rather, it usually means something vaguer like implementing the concepts that tokens denote rather than the tokens themselves in search and in representation. But even given this vague notion, I think that there's a relatively clear criterion, (one that, I might add, many extant alleged semantic search and semantic web implementations fail to meet), that will allow us to pass over the linked data vs. RDF debate and jump to the very crux of the matter with respect to semantics. I propose that a representation and query system is semantic to the extent that it's able to identify correct or useful query responses despite the fact that some terms in the query or salient query disjunct are not present in the response.  That, to my mind, is a fair and interesting test of the extent to which the system is implementing the concepts that tokens represent rather than just the tokens themselves.

The condition is, I realize, neither necessary nor completely sufficient* to establish the presence of the implementation of semantics, but it is a fairly strong and reliable indicator. It's difficult to realize this condition without being able to do some reasoning about the concepts in the query. As such, I would suggest there are really a series of "tipping points" or at least types of queries and responses that realize this condition that would suggest we're well on the way to realizing a truly semantic web.  Here's a set of example queries, or query types, and the kinds of results they'd need to meet the criterion.
  1. Synonyms: To my mind, the simplest search implementation that would meet the criterion and have a legitimate claim at implementing semantics is any system that will recognize synonyms. Simple implementation of synonyms is implemented in Google already at present. Synonym recognition doesn't require extensive understanding of meaning but it does require some sort of semantic model for supplementing search. An example, searches for 'car parts' that return results containing phrases like 'automotive parts' or 'auto parts'.
  2. Subclass:  Search engine users have high expectations of search engines in terms of natural language understanding. However, they tend to be very forgiving of the fact that state of the art search engines are for the most part completely incapable of doing any sort of subtype reasoning, although natural language questions do involve this.  Why shouldn't we expect a query on, say, "graph traversal algorithm AND scripting language" to be able to identify documents discussing a depth first search algorithm written in perl?  Querying over subtypes is a far better test for the implementation semantics and has the potential to make the querying system far more powerful as an information retrieval tool. Simple examples include queries such as 'heart disease drug' and getting results withouth 'heart disease'  or "drug" but containing instead, for example,  'Myocardial infarction' and "aspirin". Or we might imagine queries for 'vegetable side dishes for poultry' returning documents lacking those terms but returning references to green bean casserole recipes to accompany turkey.  Of course, it's worth noting that such semantic search tools exist already and don't require the maturity of the more formal semantic web to be realized. Consider for example, a search for 'heart disease drug' in Search Medica or a search for 'meat with vegetables' in Yummly.
  3. Instances: We can also imagine a search system that allowed us to search 'NHL team bankruptcy' and returning documents about, say, the Buffalo Sabres financial plight of some years ago even if the document failed to contain the phrase 'NHL team', i.e., based on recognition that 'Buffalo Sabres' is an instance of NHL team. Or, why shouldn't we expect a search tool to allow us to query "SCOTUS judges Harvard" and be able to retrieve documents containing references to Harvard and particular SCOTUS judges?
  4. Another useful kind of subsumption reasoning is the recognition of parthood. This would be particularly useful for queries referring to geographical entities, e.g., in travel queries, "find airports in Northeastern USA".  Other examples include a search for 'baseball teams in Southern United States' that recognized that references to, say, Alabama, are relevant or queries on 'cancer treatment in Canada' that recognized references to British Columbia as potentially salient.
  5. Negation reasoning: Another particularly useful test for semantics is the ability to do actual conceptual negation in a query. For example, I often like to search for soup recipes that don't include meat. However, searches for 'soup NOT meat', typically only return references without "meat", but again it would be most useful if they also left out "chicken", "beef" etc.
  6. Common sense/rule following: In a recent article about the ITA Software acquisition, a Google VP, Jeff Huber,  asked  "How cool would it be if you could type ‘flights to somewhere sunny for under $500 in May’ into Google and get not just a set of links but also flight times, fares and a link to sites where you can actually buy tickets quickly and easily?" This, I would argue, is an excellent "tipping point" query for the semantic web. While linked data is required for such a query, it wouldn't be sufficient. Recognizing which locations satisfied 'somewhere sunny' would indeed be indicative that the system is implementing semantics.
There are, of course, lots of improvements to be realized in search that don't meet the criterion I've spelled out. Improvements in ranking and categorizing and search suggestion and result extraction may, in fact, be of as much utility as improvements that implement these kinds of semantics. I just wish we'd stop using "semantics" for any kind of addition of structured data to documents or results.

*To the sufficiency question, there are some query response systems that meet the criterion I propose but which fail to be "semantic" in any reasonable sense of the term. As mentioned, stemming variants wrt the query probably aren't good examples. Nor are search tools that allow us to constrain dates and values, e.g., a Craigslist search that allows one to specify a maximum price (or age) or a newspaper archive search allowing me to constrain dates. Any satisfaction of the criterion that is realized almost solely via arithmetic, probably doesn't.

No comments: