E N D
Using TREC for cross-comparison between classic IR and ontology-based search models at a Web scaleMiriam Fernández1, Vanessa López2, Marta Sabou2, Victoria Uren2, David Vallet1, Enrico Motta2, Pablo Castells1Semantic Search 2009 Workshop (SemSearch 2009)18th International World Wide Web Conference (WWW 2009)21st April 2009, Madrid
Table of contents • Motivation • Part I. The proposal: a novel evaluation benchmark • Reusing the TREC Web track document collection • Introducing the semantic layer • Reusing ontologies from the Web • Populating ontologies from Wikipedia • Annotating documents • Part II. Analyzing the evaluation benchmark • Using the benchmark to compare an ontology-based search approach against traditional IR baselines • Experimental conditions • Results • Applications of the evaluation benchmark • Conclusions
Motivation (I) • Problem: How can semantic search systems be evaluated and compared with standard IR systems to study whether and how semantic search engines offer competitive advantages? • Traditional IR evaluation • Evaluation methodologies generally based on the Cranfield paradigm (Cleverdon, 1967) • Documents, queries and judgments • Well-known retrieval performance metrics • Precision, Recall, P@10, Average Precision (AP), Mean Average Precision (MAP) • Wide initiatives like TREC to create and use standard evaluation collections, methodologies and metrics • The evaluation methods are systematic, easily reproducible, and scalable
Motivation (II) • Ontology-based search evaluation • Ontology-based search approaches • Introduction of a new semantic search space (ontologies and KBs) • Change in the IR vision (input, output, scope) • The evaluation methods rely on user-centered studies, and therefore they tend to be high-cost, non-scalable and difficult to reproduce • There is still a long way to define standard evaluation benchmarks for assessing the quality of ontology-based search approaches • Goal: develop a new reusable evaluation benchmark for cross-comparison between classic IR and ontology-based models on a significant scale
Part I. The proposal • Motivation • Part I. The proposal: a novel evaluation benchmark • Reusing the TREC Web track document collection • Introducing the semantic layer • Reusing ontologies from the Web • Populating ontologies from Wikipedia • Annotating documents • Part II. Analyzing the evaluation benchmark • Using the benchmark to compare an ontology-based search approach against traditional IR baselines • Experimental conditions • Results • Part III. Applications of the evaluation benchmark • Conclusions
The evaluation benchmark (I) • A benchmark collection for cross-comparison between classic IR and ontology-based search models at a large scale should comprise five main components: • a set of documents, • a set of topics or queries, • a set of relevance judgments (or lists of relevant documents for each topic), • a set of semantic resources, ontologies and KBs, which provide the need semantic information for ontology-based approaches. • a set of annotations that associate the semantic resources with the document collection (not needed for all ontology-based search approaches)
The evaluation benchmark (II) • Start from a well-known standard IR evaluation benchmark • Reuse of the TREC Web track collection used in the TREC 9 and TREC 2001 editions of the TREC conference • Document collection: WT10g (Bailey, Craswell, & Hawking, 2003). About 10GB in size.1.69 million Web pages • The TREC topics and judgments for this text collection are provided with the TREC 9 and TREC 2001 datasets
The evaluation benchmark (III) • Construct the semantic search space • In order to fulfill Web-like conditions, all the semantic search information should be available online • The selected semantic information should cover, or partially cover, the domains involved in the TREC query set • The selected semantic resources should be completed with a larger set of random ontologies and KBs to approximate a fair scenario • If the semantic information available online has to be extended in order to cover the TREC queries, this must be done with information sources which are completely independent from the document collection, and available online
The evaluation benchmark (IV) • Document collection • TREC WT10G • Queries and judgments • TREC 9 and TREC 2001 test corpora • 100 queries with their corresponding judgments • 20 queries selected and adapted to be used by a NLP QA query processing module • Ontologies • 40 public ontologies covering a subset of the TREC domains and queries (370 files comprising 400MB of RDF, OWL and DAML) • 100 additional repositories (2GB of RDF and OWL) • Knowledge Bases • Some of the 40 selected ontologies have been semi-automatically populated from Wikipedia • Annotations • 1.2 · 108 non-embedded annotations generated and stored in a MySQL database
The evaluation benchmark (V) • Selecting TREC queries • Queries have to be formulated in a way suitable for ontology-based search systems (informational queries) • E.g., queries such as “discuss the financial aspects of retirement planning” (topic 514) are not selected • Ontologies must be available for the domain of the query • We selected 20 queries • Adapting TREC queries
The evaluation benchmark (VI) • Populating ontologies from Wikipedia • The semantic resources available online are still scarce and incomplete (Sabou, Gracia, Angeletou, D'Anquin, & Motta, 2007) • Generation of a simple semi-automatic ontology-population mechanism • Populates ontology classes with new individuals • Extracts ontology relations for a specific ontology individual • Uses Wikipedia lists and tables to extract this information
The evaluation benchmark (VII) • Annotating documents with ontology entities • Identify ontology entities (classes, properties, instances or literals) within the documents to generate new annotations • Do not populate ontologies, but identify already available semantic knowledge within the documents • Support annotation in open domain environments (any document can be associated or linked to any ontology without any predefined restriction). This brings scalability limitations. To solve them we propose to: • Generate of ontology indices • Generate of document indices • Construct an annotation database which stores non-embedded annotations
The evaluation benchmark (VIII) • Annotation based on contextual semantic information • Ambiguities: exploit ontologies as background knowledge (increasing precision but reducing the number of annotations)
Part II. Analyzing the evaluation benchmark • Motivation • Part I. The proposal: a novel evaluation benchmark • Reusing the TREC Web track document collection • Introducing the semantic layer • Reusing ontologies from the Web • Populating ontologies from Wikipedia • Annotating documents • Part II. Analyzing the evaluation benchmark • Using the benchmark to compare an ontology-based search approach against traditional IR baselines • Experimental conditions • Results • Applications of the evaluation benchmark • Conclusions
Experimental conditions • Keyword-based search (Lucene) • Best TREC automatic search • Best TREC manual search • Semantic-based search (Fernandez, et al., 2008)
Results (I) • Figures in bold correspond to best result for each topic, excluding the best TREC manual approach (because of the way it constructs the query) MAP: mean average precision P@10: precision at 10
Results(II) • By P@10, the semantic retrieval outperforms the other two approaches • It provides maximal quality for 55% of the queries and it is only outperformed by both Lucene and TREC in one query (511) • Semantic retrieval provides better results than Lucene for 60% of the queries and equal for another 20% • Compared to the best TREC automatic engine, our approach improves 65% of the queries and produces comparable results in 5% • By MAP, there is no clear winner • The average performance of TREC automatic is greater than semantic retrieval. • Semantic retrieval outperforms TREC automatic in 50% of the queries and Lucene in 75% • Bias in the MAP measure • More than half of the documents retrieved by the semantic retrieval approach have not been rated in the TREC judgments • The annotation technique used for the semantic retrieval approach is very conservative (missing potential correct annotations)
Results(III) • For some queries for which the keyword search (Lucene) approach finds no relevant documents, the semantic search does • queries 457 (Chevrolet trucks), 523 (facts about the five main clouds) and 524 (how to erase scar?) • In the queries in which the semantic retrieval did not outperform the keyword baseline, the semantic information obtained by the query processing module was scarce • Still, overall, the keyword baseline only rarely provides significantly better results than semantic search • TREC Web search evaluation topics are conceived for keyword-based search engines • With complex structured queries (involving relationships), the performance of semantic retrieval would improve significantly compared to the keyword-based • The full capabilities of the semantic retrieval model for formal semantic queries were not exploited in this set of experiments
Results(IV) • Studying the impact of retrieved non-evaluated documents • 66% of the results returned by semantic retrieval were not judged • P@10 not affected. Results in the first positions have a higher probability of being evaluated • MAP: evaluating the impact • Informal evaluation of the first 10 unevaluated results returned for every query • 89% of these results occur in the first 100 positions for their respective query • A significant portion, 31.5%, of the documents we judged turned out to be relevant • Even though this can not be generalized to all the unevaluated results returned by the semantic retrieval approach (the probability of being relevant drops around the first 100 results and then varies very little) we believe that the lack of evaluations for all the results returned by the semantic retrieval impairs its MAP value
Applications of the Benchmark • Goal: How this benchmark can be applied to evaluate other ontology-based search approaches?
Conclusions (I) • In the semantic search community, there is the need of having standard evaluation benchmarks to evaluate and compare ontology-based approaches against each other, and against traditional IR models • In this work, we have addressed two issues: • Construction of a potentially widely applicable ontology-based evaluation benchmark from traditional IR datasets, such as the TREC Web track reference collection • Use the benchmark to evaluate a specific ontology-based search approach (Fernandez, et al., 2008) against different traditional IR models at a large scale
Conclusions (II) • Potential limitations of the above benchmark are: • The need of ontology-based search systems to participate in the pooling methodology to obtain a better set of document judgments • The use of queries with a low level of expressivity in terms of relations, more oriented to traditional IR models • The scarceness of the publicly available semantic information to cover the meanings involved in the document search space • A common understanding of ontology-based search in terms of inputs, outputs and scope should be reached before achieving a real standardization in the evaluation of ontology-based search models
Thank you! http://nets.ii.uam.es/miriam/thesis.pdf (chapter 6)http://nets.ii.uam.es/publications/icsc08.pdf