210 likes | 362 Views
Learning to Cluster Web Search Results . SIGIR 04. ABSTRACT. Organizing Web search results into clusters facilitates users quick browsing through search results. Traditional clustering techniques They don ’ t generate clusters with highly readable names.
E N D
ABSTRACT • Organizing Web search results into clusters facilitates users quick browsing through search results. • Traditional clustering techniques • They don’t generate clusters with highly readable names. • Need pre-defined categories as in classification method. • Based on a regression model learned from human labeled training data, convert an unsupervised clustering problem to a supervised learning problem.
INTRODUCTION • User submits query “jaguar” into Google • Results related to “big cat”, user should go to the 10th,11th,32nd and 71st results. • A possible solution to this problem is to online cluster search result into different groups. • Ranking salient phrases as cluster names. • Re-formalize the clustering problem as a salient phrases ranking problem.
INTRODUCTION Titles and snippets Salient phrases *Real demonstration of this technique http://vivisimo.com/
INTRODUCTION • Leouski A. V. and Croft W. B. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-76, Department of Computer Science, 1996. • Zamir O., Etzioni O. Web Document Clustering (SIGIR'98), 1998. • Zamir O., Etzioni O. Grouper: A Dynamic Clustering Interface to Web Search Results. (WWW8),1999. • Leuski A. and Allan J. Improving Interactive Retrieval by Combining Ranked List and Clustering. Proceedings of RIAO, 2000. • Liu B., Chin C. W., and Ng, H. T. Mining Topic-Specific Concepts and Definitions on the Web. (WWW'03), 2003
Problem Formalization And Algorithm • Problem Formalization: • Ranked list of search result : • q : current query , di : document • r : some (unknown) function calculate the probability • To find a set of topic-coherent clusters on query q (Traditional): • To find a ranked list of clusters C’,with each cluster associated with a cluster name as well as a new ranked list of documents: • Algorithm:four steps • Search result fetching, • Document parsing and phrase property calculation • Salient phrase ranking,and • Post-processing
Salient Phrases Extraction 1/3 • Five properties: • 1.Phrase frequency / Inverted document frequency (TFIDF) • w: current phrase , D(w) : the set of documents that contains w. • 2.Intra-Cluster Similarity (ICS) • Documents into vector space model: di=(xi1,xi2,…). • Each component of the vectors is weighted by TFIDF • For each candidate cluster calculates its centroid as: • ICS is calculate as:
Salient Phrases Extraction 2/3 • 3.Phrase Length (Len) • Example: Len(big)=1 , Len(big cats)=2. • 4.Cluster Entropy(CE) • For given phrase w, the corresponding document set D(w) might overlaps with other D(wi) where wi != w. • One extreme : • Too general phrase to be a good salient phase. • Other extreme : • D(w) seldom overlap with D(wi) , w may have some distance meaning. • Examples: Take query “jaguar” as an example , “big cat” seldom co-occur with other salient keywords such as “car”, “mac os”,etc.
Salient Phrases Extraction 3/3 • 5.Phrase Independence **(IND) • A phrase is independent when the entropy of its context is high. ** Chien L. F. PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. (SIGIR'97),1997.
Learning to Rank Salient Phrases 1/3 • Regression is a classic statistical problem which tries to determine the relationship between two random variablesx=(x1,x2,…,xp) and y. • X=(TFIDF,LEN,ICS,CE,IND) • Y can be any real-valued score. • Linear Regression: • Residual e is a random variable. • The coefficients are determined by the condition that the sum of the square residuals is as small as possible.
Learning to Rank Salient Phrases 2/3 • Logistic Regression: • When the dependent variable Y is dichotomy, logistic regression is more suitable. • Because we want to predict is not a precise numerical value of a dependent variable, but rather the probability. • Whereas q can only range from 1 to 0 • Logit(q) ranges from negative infinity to positive infinity.
Learning to Rank Salient Phrases 3/3 • Support vector Regression : • Input x is first mapped on a high dimensional feature space using some nonlinear mapping. • e-insensitive loss function: • SV regression tries to minimize ||w||2 ***Joachims T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. Schölkopf B. and Burges C. and Smola A. (ed.), MIT-Press, 1999.
Experiments • Default result numbers from search engines are set to 200. • Evaluation Measure: • Traditional clustering algorithm is difficult to be evaluated. • In this approach, evaluation is relatively easy because the problem is defined to be a ranking problem. • Using classical evaluation method in Information Retrieval. P@N : precision at top N result R : set of top N salient keywords. C : set of manually tagged correct salient keywords.
Experiments • Training Data Collection: • 3 human evaluators to label ground truth data for 30 queries. • Selected from one day’s query log from MSN.
Experiments - Training Data Collection: • For each query extract all the n-gram(n<=3) from the search results as candidate phrases. • 3 evaluators selected the candidates: • 10 “good phrases” ( assign score 100) • 10 “medium phrases” (assign score 50) • Other phrases are zero score. • Finally,three score add together and assign 1 to the y values of phrases with score greater than 100, and assign 0 to the y values of others.
Experimental Results • Property Comparison:
Experimental Results • Learning methods comparison: • Three-fold cross validation to evaluate 3 regression method
CONCLUSION AND FUTURE WORKS • Several properties, as well as several regression models, are proposed to calculate salience score for salient phrase. • Clusters with short names hopefully is more readable,could improve user’s browsing efficiency through search result. • In the future works: • To extract syntactic features for keywords and phrases to assist the salient phrase ranking. • Hierarchical structure of search results is necessary for more efficient browsing. • Some external taxonomies such as Web directories contains much knowledge, thus a combination of classification and clustering might be helpful in this application.