Mining Anchor Text for Query Refinement Reiner Kraft and Jason Zien IBM Almaden Research Center

Mining Anchor Text for Query Refinement Reiner Kraft and Jason Zien IBM Almaden Research Center Mark Strohmaier

Problem Motivation • 23% of search queries are single-term • Expanding the query can lead to more accurate searches • Previous studies indicate that anchor text is statistically similar to search queries • Can this similarity be exploited to improve search queries?

What is anchor text? • <a href=”this is the website”> This is the anchor text </a> • Destination pages can have multiple links pointing to them • Collections of anchor text can give a view of the destination page • Naïve approach: • Find links whose anchor text is similar to the query • Return the links destination pages to the user

Problems with naïve approach • High term frequency is not directly related to page quality • Repeated terms may lead to unnatural queries • IDF is not necessarily relevant • Anchor text may appear multiple times

Methods of Query Refinement • Weighting the number of occurrences • Weight based on the type of anchor text • Number of terms in the anchor text • Smaller terms is better • Number of characters in the anchor text • More concise queries are better

Benefits of the Anchor Text • There is much less anchor text than document text • Pages can have many incoming links • Refined anchor text can capture a degree of site popularity

Mining Anchor Text • Initial web crawl covered 33 million links on IBM intranet • Additionally, roughly 350,000 queries were analyzed • Both categories showed a similar relationship between length and number of occurrences

Pre-processing Summaries • Query refinement is sensitive to the number of terms • Too few may not lead to much improvement • Too many may lead to overspecialization Best results were for MAXCOUNT = 3

Studies Performed • Three different approaches were compared • Anchor • Ranked Anchor Text refinement • Doc.SW • This ranked pages based on the most frequently occurring 2 and 3 term phrases • DOC • Similar to Doc.SW, but not counting stop words

Ranking Anchor Texts • The results are ranked based on • WCOUNT score • Number of terms in the anchor summary • Number of characters in the anchor summary

Comparison of Methods • Second comparison tested 22 different queries • QUERYLOG processes and dynamically updates user queries based on previous ones, in a similar manner as ANCHOR

Conclusions • Using anchor text leads to better results than performing similar methods on document collections • A similar approach can be used to refine user search queries as well

Future Directions • Broadening search queries • Lexical analysis, rather than straight textual • Pre- and Post- anchor text

Mining Anchor Text for Query Refinement Reiner Kraft and Jason Zien IBM Almaden Research Center