190 likes | 339 Views
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results. Celina Santamaria Julio Gonzalo Javier Artiles n lp.uned.es UNED,c /Juan del Rosal , 16, 28040 Madrid, Spain celina.santamaria@gmail.com julio@lsi.uned.es javart@bec.uned.es. ACL 2010. Introduction.
E N D
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina Santamaria Julio Gonzalo Javier Artiles nlp.uned.es UNED,c/Juan del Rosal, 16, 28040 Madrid, Spain celina.santamaria@gmail.com julio@lsi.uned.es javart@bec.uned.es ACL 2010
Introduction • Word sense Disambiguation(WSD) • Promoting diversity in the search result • Present the results as a set of clusters • Complement search results with search suggestions • Two lexical resource • Wikipedia • Wordnet3.0
Introduction • Problem • Coverage • Estimate search results diversity using our senses • Sense frequencies • Classification
Test Set • It are susceptible to form a one-word query • Denote one or more named entities • 40 nouns • 15 nouns from the Senseval-3 lexical sample dataset • 25 nouns which satisfy two conditions • Ambiguous • They are all names for music bands in one of their senses
Test Set • Average of 22 senses per noun in Wikipedia • Average of 4.5 senses per noun in Wordnet • Wikipedia has an larger coverage • Retrieve 150 documents for each noun(Google) • Annotate each document in each of the dictionaries
Coverage of Web Search Results • If we focus on the top ten results, in the band subset Wikipedia covers 68% of the top ten documents • In the top ten results that are not covered by Wikipedia • a majority of the missing senses consists of names of companies(45%) and products or services(26%) • the other frequent type (12%) of non annotated document is disambiguation pages
Coverage of Web Search Results • Wikipedia seems to extend the coverage of Wordnet rather than providing complementary sense information • If we want to extend the coverage of Wikipedia, the best strategy seems to be to consider lists of companies, products and services
Diversity in Google Search Results • Use Wikipedia senses to test how well search results respect diversity in terms of this subset of senses • 63% of the pages in search results belong to the most frequent sense of the query word • Diversity may not play a major role in the current Google ranking algorithm
Sense Frequency Estimators for Wikipedia • Frequency information is crucial in a lexicon • But Wikipedia don’t provide the relative importance of senses for a given word • Attempt to use two estimators of expected sense distribution • Incoming links for the sense page • The number of visits for the sense page(May, June and July 2009 http://stats.grok.se/)
Association of Wikipedia Senses to Web Pages • Test whether the information can be used to classify search results accurately • No consider approaches that involve a manual training data • A web page p and the set of senses w1,…wn listed in Wikipedia • Approach • Vector Space Model(VSM) • Word Sense Disambiguation(WSD) System • Random • Assign the most frequent sense to all documents
VSM • Represent page in a vector space model(tf*idf weights) • VSM : compute idf in the collection of retrieval documents • VSM-GT : use the statistics provided by the google Terabyte collection • VSM-mix : combine statistics from the collection and from the Google Terabyre Collection • VSM-GT+freq
WSD system • Extract learning examples from the Wikipedia automatically • Disambiguate all occurrences of word w in the page p • TiMBL-core : use only the examples found in the Wikipedia page • TiMBL-inlinks : use the examples found in Wikipedia pages pointing to the page • TiMBL-all : use both sources of examples • TiMBL-core+freq
Classification Results • VSM is a simpler and more efficient approach • May indicate that using frequency estimations is only helpful up to certain precision ceiling
Precision/Coverage Trade-off • All systems assign a sense for every document in the test collection • It is possible to enhance search results diversity without annotating every document • Set threshold[0.00-0.90]
Using Classification to Promote Diversity • Use our best classifier(VSM-GT+freq) • Make a list of the top-ten documents • Maximize the number of senses • Maximize the similarity scores of the documents to their assigned senses • Algorithm • Fill each position in the rank with the highest similarity sense which are not yet represented in the rank • Once all senses are represented, we start choosing a second representative for each sense
Using Classification to Promote Diversity • Other approaches • Clustering(centroids) • Clustering(top ranked) • Random • Upper bound
Using Classification to Promote Diversity • coverage=the number of senses in the top ten result / the number of senses in all search results • Using wikipedia to enhance diversity seems to work much better than clustering • Note, Our evaluation has a bias towards using Wikipedia, because only Wikipedia senses are considered to estimate diversity
Conclusion • Wikipedia has a much better coverage • The distribution of senses can be esitmated • Improve search results diversity for one word queries with simple and efficient algorithm • Our results do not imply that the Wikipedia modified rank is better than the original Google rank