Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina Santamaria Julio Gonzalo Javier Artiles nlp.uned.es UNED,c/Juan del Rosal, 16, 28040 Madrid, Spain celina.santamaria@gmail.com julio@lsi.uned.es javart@bec.uned.es ACL 2010

Introduction • Word sense Disambiguation(WSD) • Promoting diversity in the search result • Present the results as a set of clusters • Complement search results with search suggestions • Two lexical resource • Wikipedia • Wordnet3.0

Introduction

Introduction • Problem • Coverage • Estimate search results diversity using our senses • Sense frequencies • Classification

Test Set • It are susceptible to form a one-word query • Denote one or more named entities • 40 nouns • 15 nouns from the Senseval-3 lexical sample dataset • 25 nouns which satisfy two conditions • Ambiguous • They are all names for music bands in one of their senses

Test Set • Average of 22 senses per noun in Wikipedia • Average of 4.5 senses per noun in Wordnet • Wikipedia has an larger coverage • Retrieve 150 documents for each noun(Google) • Annotate each document in each of the dictionaries

Coverage of Web Search Results • If we focus on the top ten results, in the band subset Wikipedia covers 68% of the top ten documents • In the top ten results that are not covered by Wikipedia • a majority of the missing senses consists of names of companies(45%) and products or services(26%) • the other frequent type (12%) of non annotated document is disambiguation pages

Coverage of Web Search Results • Wikipedia seems to extend the coverage of Wordnet rather than providing complementary sense information • If we want to extend the coverage of Wikipedia, the best strategy seems to be to consider lists of companies, products and services

Diversity in Google Search Results • Use Wikipedia senses to test how well search results respect diversity in terms of this subset of senses • 63% of the pages in search results belong to the most frequent sense of the query word • Diversity may not play a major role in the current Google ranking algorithm

Sense Frequency Estimators for Wikipedia • Frequency information is crucial in a lexicon • But Wikipedia don’t provide the relative importance of senses for a given word • Attempt to use two estimators of expected sense distribution • Incoming links for the sense page • The number of visits for the sense page(May, June and July 2009 http://stats.grok.se/)

Association of Wikipedia Senses to Web Pages • Test whether the information can be used to classify search results accurately • No consider approaches that involve a manual training data • A web page p and the set of senses w1,…wn listed in Wikipedia • Approach • Vector Space Model(VSM) • Word Sense Disambiguation(WSD) System • Random • Assign the most frequent sense to all documents

VSM • Represent page in a vector space model(tf*idf weights) • VSM : compute idf in the collection of retrieval documents • VSM-GT : use the statistics provided by the google Terabyte collection • VSM-mix : combine statistics from the collection and from the Google Terabyre Collection • VSM-GT+freq

WSD system • Extract learning examples from the Wikipedia automatically • Disambiguate all occurrences of word w in the page p • TiMBL-core : use only the examples found in the Wikipedia page • TiMBL-inlinks : use the examples found in Wikipedia pages pointing to the page • TiMBL-all : use both sources of examples • TiMBL-core+freq

Classification Results • VSM is a simpler and more efficient approach • May indicate that using frequency estimations is only helpful up to certain precision ceiling

Precision/Coverage Trade-off • All systems assign a sense for every document in the test collection • It is possible to enhance search results diversity without annotating every document • Set threshold[0.00-0.90]

Using Classification to Promote Diversity • Use our best classifier(VSM-GT+freq) • Make a list of the top-ten documents • Maximize the number of senses • Maximize the similarity scores of the documents to their assigned senses • Algorithm • Fill each position in the rank with the highest similarity sense which are not yet represented in the rank • Once all senses are represented, we start choosing a second representative for each sense

Using Classification to Promote Diversity • Other approaches • Clustering(centroids) • Clustering(top ranked) • Random • Upper bound

Using Classification to Promote Diversity • coverage=the number of senses in the top ten result / the number of senses in all search results • Using wikipedia to enhance diversity seems to work much better than clustering • Note, Our evaluation has a bias towards using Wikipedia, because only Wikipedia senses are considered to estimate diversity

Conclusion • Wikipedia has a much better coverage • The distribution of senses can be esitmated • Improve search results diversity for one word queries with simple and efficient algorithm • Our results do not imply that the Wikipedia modified rank is better than the original Google rank

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

Presentation Transcript

Clustering Web Search Results

Applying Diversity Metrics to Improve the Selection of Web Search Term Refinements

Clustering Web Search Results

Multilingual Word Sense Disambiguation using Wikipedia

Attempting to Use Wikipedia Categories to Improve Retrieval

Coarse to Fine Grained Sense Disambiguation in Wikipedia

Temporal Query Log Profiling to Improve Web Search Ranking

Wikipedia as Sence Inventory to Improve Diversity in Web Search Results

Wikipedia as Sence Inventory to Improve Diversity in Web Search Results

Wikipedia as Sence Inventory to Improve Diversity in Web Search Results

Learning to Cluster Web Search Results

Web-based Search Engines and the use of Wikipedia

Online Clustering of Web Search results

Learning to Cluster Web Search Results.

Clustering Personalized Web Search Results

Web engineering : Wikipedia

4 Amazing Tips To Improve Your Local Search Results

Improve Search Results SEO 11747

Using Web Search Methods Refining Results

Applying Diversity Metrics to Improve the Selection of Web Search Term Refinements

Google to Use Wikipedia for Reliable Information on Search Results