580 likes | 784 Views
Click Chain Model in Web Search. Fan Guo Carnegie Mellon University. Joint Work With…. Chao Liu. Anitha Kannan. Tom Minka. Mike Taylor. MSR, ISRC-Redmond. MSR, Search Lab. MSR, Cambridge. MSR, Cambridge. Yi-Min Wang. Christos Faloutsos. MSR, ISRC-Redmond.
E N D
Click Chain Model in Web Search Fan GuoCarnegie Mellon University WWW'09, Madrid, Spain
Joint Work With… Chao Liu Anitha Kannan Tom Minka Mike Taylor MSR, ISRC-Redmond MSR, Search Lab MSR, Cambridge MSR, Cambridge Yi-Min Wang Christos Faloutsos MSR, ISRC-Redmond Carnegie Mellon University
Click Logs • Auto-generated data keeping important information about search activity. WWW'09, Madrid, Spain
Problem Definition • Given a click log data set, for each query-document pair, compute user-perceived relevance. Impression Data Click Data … … WWW'09, Madrid, Spain
Relevance Representation 0.75 Previous Click Models Click Chain Model Human Judge 0 1 Integration WWW'09, Madrid, Spain
Applications • Automated Ranking Alterations • Search Engine Performance Metric • Calibrate Human Judgment • Related Application in Sponsored Search WWW'09, Madrid, Spain
Roadmap • Motivation and Problem Definition • Click Model Basics • CCM and Algorithms • Experimental Evaluation • Related Work and Conclusion WWW'09, Madrid, Spain
Eye-Tracking User Study Fixation Heat Map WWW'09, Madrid, Spain
Overall: Fixation is biased towards higher ranks, so do the clicks. • For each position:fixation/clicks are context dependent. Normal Impression Reversed Impression WWW'09, Madrid, Spain
Problem Definition (Recap) • Given a click log data set, for each query-document pair, compute user-perceived relevance and the solution should be • Aware of the position bias and context dependency • Scalable to Terabyte data • Incremental to stay updated WWW'09, Madrid, Spain
Examination Hypothesis • User behavior abstraction: Fixation → binary examination variable Click → binary click variable • A document must be examined before being clicked. WWW'09, Madrid, Spain
Examination Hypothesis • For each position, P(Click=1) = P(Examination=1) * Relevance Relevance = P(Click=1|Examination=1) • The position bias is reflected in the derivation of P(Examination). WWW'09, Madrid, Spain
Cascade Hypothesis • User scans through documents and make decisions in strict linear order. • The decision process: E1, C1, E2, C2,… • Essential part of click model: • What is the probability of “See Next Doc”? WWW'09, Madrid, Spain
Roadmap • Motivation and Problem Definition • Click Model Basics • CCM and Algorithms • Experimental Evaluation • Related Work and Conclusion WWW'09, Madrid, Spain
The Context • Top-10 organic search results only. • Query sessions are independent. • Semantic info are not used. Suggestions Ads Other Elements WWW'09, Madrid, Spain
User Behavior Description Examine the Document Click? No Yes See Next Doc? No Yes Done Yes See Next Doc? No Done WWW'09, Madrid, Spain
Click Chain Model … R1 R2 R3 R4 R5 … E1 E2 E3 E4 E5 C1 C2 C3 C4 C5 … WWW'09, Madrid, Spain
Why Bayesian? • Modeling Benefit: • A principled way of smoothing the relevance estimates; • Offers more flexibility such as computing P(Ri>Rj). • Computational Benefit: • Avoid iterative optimization procedure in maximum-likelihood estimation WWW'09, Madrid, Spain
Relevance Inference • Given a query, and all its click data compute the posterior for each possible j. • Let then focus on click probability for a particular session, and look at different cases WWW'09, Madrid, Spain
Click Chain Model … R1 R2 R3 R4 R5 Cascade Hypothesis … E1 E2 E3 E4 E5 Examination Hypothesis C1 C2 C3 C4 C5 … WWW'09, Madrid, Spain
0 1 0 1 … R1 R2 R3 R4 R5 … E1 E2 E3 E4 E5 C1 C2 C3 C4 C5 … WWW'09, Madrid, Spain
0 1 0 1 … R1 R2 R3 R4 R5 … E1 E2 E3 E4 E5 C1 C2 C3 C4 C5 … WWW'09, Madrid, Spain
0 1 0 1 … R1 R2 R3 R4 R5 … E1 E2 E3 E4 E5 C1 C2 C3 C4 C5 … WWW'09, Madrid, Spain
0 1 0 1 … R1 R2 R3 R4 R5 … E1 E2 E3 E4 E5 C1 C2 C3 C4 C5 … WWW'09, Madrid, Spain
0 1 0 1 … R1 R2 R3 R4 R5 … E1 E2 E3 E4 E5 C1 C2 C3 C4 C5 … WWW'09, Madrid, Spain
Putting them together WWW'09, Madrid, Spain
Summary of the Algorithm • Initializing (2*10+2) counts for each pair; • Go through the click log once and update the counts; • Compute parameter values and get β values; • Ready to output results (using numerical integration if necessary). WWW'09, Madrid, Spain
Sanity Check • The algorithm should be • Aware of the position bias and context dependency • Scalable to Terabyte data Single Pass, Linear • Incremental to stay updated Update counts WWW'09, Madrid, Spain
Roadmap • Motivation and Problem Definition • Click Model Basics • CCM and Algorithms • Experimental Evaluation • Related Work and Conclusion WWW'09, Madrid, Spain
Data Set • Collected in 2 weeks in July 2008. • Preprocessing: • Discard no-click sessions for fair comparison. • 178 most frequent queries removed. • Split to training/test sets according to time stamps. WWW'09, Madrid, Spain
Data Set • After preprocessing: • 110,630 distinct queries; • 4.8M/4.0M query sessions in the training/test set. WWW'09, Madrid, Spain
Metric • Efficiency: • Computational Time • Effectiveness: • With known document identities in the test set, • Using the relevance and parameter learned on the training set, • To do Click Prediction. (resort to indirect measure) WWW'09, Madrid, Spain
Competitors • UBM: User Browsing Model (Dupret et al., SIGIR’08) • More parameters • Iterative, more expensive algorithm • DCM: Dependent Click Model (WSDM’09) • Modeling 1+ clicks per session WWW'09, Madrid, Spain
Results - Time • Environment: Unix Server, 2.8GHz cores, MATLAB R2008b. WWW'09, Madrid, Spain
Results – Perplexity • Perplexity: quality of click prediction for each position individually. Random Guess (pH=0.5): 2.00 Best Guess (pH=0.8): 1.65 Ground Truth (Cheating): 1.00 WWW'09, Madrid, Spain
Results – Perplexity Worse Better WWW'09, Madrid, Spain
Results – Perplexity • Average Perplexity over top 10 positions. WWW'09, Madrid, Spain
Results – Log Likelihood • Log-likelihood: log of the chance to recover the entire click vector out of 210 possibilities. WWW'09, Madrid, Spain
Results – Log Likelihood Better Worse WWW'09, Madrid, Spain
Roadmap • Motivation and Problem Definition • Click Model Basics • CCM and Algorithms • Experimental Evaluation • Related Work and Conclusion WWW'09, Madrid, Spain
Related Work • User behavior study and hypothesis • Eye-tracking Study (Joachims et al., KDD’05, ACM TOIS) • Examination Hypothesis (Richardson et al., WWW’07) • Cascade Hypothesis (Craswell et al., WSDM’08) • Other click models • Logistic Regression (Dupret et al., SIGIR’08) • Dynamic Bayesian Network (Chapelle et al., WWW’09) • Bayesian Browsing Model (KDD’09, To appear) WWW'09, Madrid, Spain
Conclusion • Click Chain Model • A probabilistic approach to interpret clicks. • A Bayesian approach to model relevance. • Both scalable and incremental. • Future Directions • Validation/Bucket Test. • Pairwise comparison • More on context dependency WWW'09, Madrid, Spain
Thank you :-) WWW'09, Madrid, Spain
Abstract/Document Relevance • Relevance of Abstract: • Conditional probability of click as defined by examination hypothesis • Relevance of Document: • Determines the probability of “See Next Doc” • A binary random variable (integrated out under CCM) WWW'09, Madrid, Spain
Alt. User Behavior Description Examine the Document Yes See Next Doc? No Click? Yes Yes See Next Doc? No Relevant? Yes Yes See Next Doc? WWW'09, Madrid, Spain
Results – Perplexity (by Freq) Worse Better WWW'09, Madrid, Spain
Examination/Click Distribution WWW'09, Madrid, Spain