Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309

Improving Web Page Classification by Label-propagation over Click GraphsSoo-Min Kim, Patrick Pantel, Lei Duan and Scott GaffneyYahoo ! Labs CIKM 2009 Presenter: Lung-Hao Lee (李龍豪) January 7, 2010@Room 309

Outlines • Introduction • Calculating Page Similarity • Finding Similar Pages • Click Data Model (CDM) • Query Constraint (QC) algorithm • Experimental Results • Discussion • Conclusion

Introduction • Large labor cost of annotating the data • The aggregated click data across many users over time provides valuable information • Leveraging click logs to argument training data by propagating class labels to unlabeled similar documents

Hypothesis • “Two pages that tend to be clicked by the same user queries tend to be topically similar” “How to tie a neck tie knots ” “How to tie a tie” “Tying a tie” Unknown Label “Positive” ? Label as “Positive” (class “How-to”) A B

Calculating Page Similarity (1/3) • A page is represented as a node in the similar graph • Normalize all the URLs e.g. the following 4 URLs are treated as the same • “http://www.acm.org” • “www.acm.org” • “www.acm.org/” • “http://www.acm.org/”

Calculating Page Similarity (2/3) • Each URL is represented as a vector of queries that users issued and clicked through to the page Pantel & Lin (2002)

Calculating Page Similarity (3/3) • Compute the similarity between two pages using the cosine similarity of their respective feature vector • sim (p1,p2) > sim (p1,p3) • sim (p1,p2) > sim(p2,p3) Because p1 and p2 share more common queries than p3

Finding Similar Pages Given Seed Sets • What’s a “seed set” ? A set of some labeled data • Two algorithms for seed set expansion • Click Data Model (CDM) • Query Constraints (QC) algorithm

Click Data Model • Two phases • Updating score phase • Filtering phase • Input • S1 (positive set) • S2 (negative set) • G (click graph) • Output • E1 (positive) • E2 (negative) • Thresholds • 0.1<T1<0.6 • 0.6<T2<1.2

Query Constraints • Additional Module that checks whether the common queries between two nodes have certain term patterns

Active Learning • Reduce the amount of human annotation effort by leveraging the click data • Build an expansion model with labeled training data and use it to select next round of training data

Experimental Setup (1/3) • Click Data • During December 2008 from Yahoo! Search engine • Only the top 10 URLs are considered • URLs with less than 10 clicks are excluded • Tree classification tasks • How-to • Adult • review

Experimental Setup (2/3) • Training sets • 10,000 manually labeled positive and negative examples • For “review” classifier, queries such as “digital camera reviews” or “baby swing reviews” • For “How-to” classifier, queries such as “how to clean uggs” or “best way to loose weight” • Testing sets

Experimental Setup (3/3) • Classifier • Gradient Boosting Decision Tree (GBDT) • Features • Textual, Link, URL, HTML, Other features • Metrics • Area Under the ROC Curve (AUC) (Fawcett, 2003) • F score • Accuracy

Label Propagation Analysis: Adult • The big improvement of CDM is observed with a model using 5000 labeled data as a seed set (+1.07% in F-score, +0.81 in Accuracy and +0.25% in AUC)

Label Propagation Analysis: Review • Reduce the manual labor by 50% • QC (exclude pages that do not have “review” in query terms) is useful when labeled data is small

Label Propagation Analysis: How-to • With 1000 and 2000 human labeled data, CDM performs worse than the baseline • QC (exclude pages that do not have “How-to” in query terms)

A comparison between baseline and CDM • Baseline: Type A • CDM: Type C

Evaluating the Active Learning Approach • From “How-to” Classifier • Seed 1Seed 2 (human label from Expnd1) Expand2

Intrinsic Analysis of CDM Expanded Data • A random sample of 50 positive and 50 negative example from “how-to” classifier • Positive class has 82.3% precision whereas negative class has 83.6% precision

Discussion • Is the proposed method always useful for web page classification ? • How can we improve the quality of automatically labeled data from unlabeled data ?

Conclusion • Present a method for improve webpage classification by leveraging click data to augment training data • Argument manually labeled data by modeling the similarity between pages in a click graph

The End • Thank you very much • Questions & Answers

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309