130 likes | 383 Views
Mining Anchor Text for Query Refinement Reiner Kraft and Jason Zien IBM Almaden Research Center. Mark Strohmaier. Problem Motivation. 23% of search queries are single-term Expanding the query can lead to more accurate searches
E N D
Mining Anchor Text for Query Refinement Reiner Kraft and Jason Zien IBM Almaden Research Center Mark Strohmaier
Problem Motivation • 23% of search queries are single-term • Expanding the query can lead to more accurate searches • Previous studies indicate that anchor text is statistically similar to search queries • Can this similarity be exploited to improve search queries?
What is anchor text? • <a href=”this is the website”> This is the anchor text </a> • Destination pages can have multiple links pointing to them • Collections of anchor text can give a view of the destination page • Naïve approach: • Find links whose anchor text is similar to the query • Return the links destination pages to the user
Problems with naïve approach • High term frequency is not directly related to page quality • Repeated terms may lead to unnatural queries • IDF is not necessarily relevant • Anchor text may appear multiple times
Methods of Query Refinement • Weighting the number of occurrences • Weight based on the type of anchor text • Number of terms in the anchor text • Smaller terms is better • Number of characters in the anchor text • More concise queries are better
Benefits of the Anchor Text • There is much less anchor text than document text • Pages can have many incoming links • Refined anchor text can capture a degree of site popularity
Mining Anchor Text • Initial web crawl covered 33 million links on IBM intranet • Additionally, roughly 350,000 queries were analyzed • Both categories showed a similar relationship between length and number of occurrences
Pre-processing Summaries • Query refinement is sensitive to the number of terms • Too few may not lead to much improvement • Too many may lead to overspecialization Best results were for MAXCOUNT = 3
Studies Performed • Three different approaches were compared • Anchor • Ranked Anchor Text refinement • Doc.SW • This ranked pages based on the most frequently occurring 2 and 3 term phrases • DOC • Similar to Doc.SW, but not counting stop words
Ranking Anchor Texts • The results are ranked based on • WCOUNT score • Number of terms in the anchor summary • Number of characters in the anchor summary
Comparison of Methods • Second comparison tested 22 different queries • QUERYLOG processes and dynamically updates user queries based on previous ones, in a similar manner as ANCHOR
Conclusions • Using anchor text leads to better results than performing similar methods on document collections • A similar approach can be used to refine user search queries as well
Future Directions • Broadening search queries • Lexical analysis, rather than straight textual • Pre- and Post- anchor text