Semantic Search: different meanings

Semantic Search: different meanings

Semantic search: different meanings • Definition 1: Semantic search as the problem of searching documents beyond the syntactic level of matching keywords • Hakia, PowerSet, SearchMonkey • Definition 2: Semantic search as the problem of searching large semantic web datasets • Watson, PowerAqua, Swoogle, Sindice, SWSE

Facing keyword-based search problems • Relations between search terms: • “books about recommender systems” vs. “systems that recommend books” • Polisemy • “mouth” as part of the body vs. “mouth” as part of a stream • Synonymy • “movies” vs. “films” • Documents about individuals where query keywords do not appear: • “English banks”, individual “Abbey”

Several attempts from the IR community • Early 80s: elaboration of conceptual frameworks and their introduction in IR models • Taxonomies (categories + hierarchical relations) , e.g., The ODP (Open Directory Project) • Thesaurus (categories + fixed hierarchical & associative relations), e.g., WordNet (used by linguistic approaches) • Algebraic methods such as LSA • Limitations: The level of conceptualization is often shallow (specially at the level of relations)

The emergence of the SW • Late 90s: introduction of ontologiesas conceptual framework (classes + instances (KBs) + arbitrary semantic relations + rules) • Semantic search: Exploiting ontologies as a richer conceptualizations & formal languages to enhance traditional keyword-based document retrieval • Semantic search: Need to search this emergent and continuously growing structured information space (the Web of Data) • DPLP, Geonames, DBPedia, BBC Music,... (http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataSets)

The Web of Data • 2007 • 2008 • 2009 Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis

LOD cloud May 2007 • Facts: • Focal points: • DBPedia: RDFizedvesion of Wikipiedia; many ingoing and outgoing links • Music-related datasets • Big datasets include FOAF, US Census data • Size approx. 1 billion triples, 250k links Figure from [4] Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis

LOD cloud September 2008 • Facts: • More than 35 datasets interlinked • Commercial players joined the cloud, e.g., BBC • Companies began to publish and host dataset, e.g. OpenLink, Talis, or Garlik. • Size approx. 2 billion triples, 3 million links Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis

LOD cloud March 2009 • Facts: • Big part from Linking Open Drug cloud and the BIO2RDF project • Notable new datasets: Freebase, OpenCalais, ACM/IEEE • Size > 10 billion triples Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis

The LOD clouds Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis

Commercial interest by publishers

Commercial interest by search engines • 2007 Yahoo! Presents Search Monkey

Commercial interest by search engines • July-2008 Microsoft buys Powerset

Commercial interest by search engines • April 2010 Facebook announced the use of the Open Graph protocol

Commercial interest by search engines • May-2009 Google announces Rich Snippets and it’s official use of RDFa and Microformats

Commercial interest by search engines • July-2010 Google buys Metaweb (the company behind FreeBase)

Commercial interest by search engines • November-2010 Google announced the support of the GoodRelations vocabulary for Google Rich Snippets.

Challenges • Exploiting this new information space for semantic search purposes opens new research challenges: • Scalability • Heterogeneity • Uncertainty

Scalability Effective exploitation of the linked data requires infrastructure that scales to a large and ever growing collection of interlinked data!

Heterogeneity SW:Person SW:/en/rudi_studer DATA-LEVEL SCHEMA-LEVEL Reconcile, Combine Align Dbpedia:Professor Dbpedia:Rudi_Studer Dblp:Studer:Rudi.html Dblp:~ley/db/../author Effective exploitation of the data web requires an effective mechanism for • finding the relevant data sources • integrating data sources • combining elements from different data sources

Uncertainty “Find action films directed by some Hong Kong film director and starring Chinese martial actors” • Incomplete Representation of User’s Needs and content meanings • User cannot completely specify the need • The semantic information in the search space is incomplete Effective exploitation requires • match user’s needs to data in an imprecise way • rank the results • be flexible enough to adjust to changes in constraints!

The Search Space: different representations

The search space: different representations • Unstructured search space • The Web of documents (textual and multimedia content) • Structured search space • The Web of data (ontologies + Knowledge Bases) • Hybrid search space • Unstructured content is enriched with metadata • Embedded annotations • Not embedded annotations

The unstructured search space • The Web of human-understandable content. • The Web of documents and links • <a href="http://creativecommons.org/licenses/by/3.0/">CC License</a> Documents Search space

Search engines

The structured search space • The Web of machine understandable content. • The Web of objects and relations • <a rel="license" href="http://creativecommons.org/licenses/by/3.0/"> Creative Commons License </a> objects Search space

Search engines

The hybrid search space • Enriching documents with metadata Objects Search space Documents How to interlink documents and data?

Two ways of interlinking metadata and documents • Information Extraction • By relying on Web publishers • More on the section Data on the (Semantic) Web

Search engines

Semantic Search: different meanings