420 likes | 526 Views
Web Taxonomy Integration through Co-Bootstrapping. Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04. Introduction. Problem Statement. Games > Roleplaying Final Fantasy Fan Dragon Quest Home EverQuest Addict Warcraft III Clan
E N D
Web Taxonomy Integration throughCo-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04
Problem Statement • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • EverQuest Addict • Warcraft III Clan • Games > Strategy • Shogun: Total War • Warcraft III Clan • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • Games > Online • EverQuest Addict • Warcraft III Clan • Games > Single-Player • Warcraft III Clan
Possible Approach Classify Train • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • EverQuest Addict • Warcraft III Clan ignores original Yahoo! categories
Another Approach (1/2) • Use Yahoo! categories • Advantage • similar categories • Potential Problem • different structure • categories do not match exactly
Another Approach (2/2) • Example: Crayon Shin-chan Entertainment > Comics and Animation > Animation > Anime > Titles > Crayon Shin-chan Arts > Animation > Anime > Titles > C > Crayon Shin-chan
This Paper’s Approach • Weak Learner (as opposed to Naïve Bayes) • Boosting to combine Weak Hypotheses • New Idea: Co-Bootstrapping to exploit source categories
Assumptions • Multi-category data are reduced to binary data • Totoro Fan Cartoon > My Neighbor Totoro Toys > My Neighbor Totoro is converted into Totoro Fan Cartoon > My Neighbor Totoro Totoro Fan Toys > My Neighbor Totoro • Hierarchies are ignored • Console > Sega and Console > Sega > Dreamcast are not related
Weak Learner • Boosting • Co-Bootstrapping Weak Learner
Weak Learner • A type of classifier similar to Naïve Bayes • + = accept • - = reject • Term may be a word or n-gram or … Weak Hypothesis (term-based classifier) After Training Weak Learner
Weak Hypothesis Example • contain “Crayon Shin-chan” • in “Comics > Crayon Shin-chan” • not in “Education > Early Childhood” • not contain “Crayon Shin-chan” • not in “Comics > Crayon Shin-chan” • in “Education > Early Childhood”
Weak Learner Inputs (1/2) • Training data are in the form [x1, y1], [x2, y2], …, [xm, ym] • xiis a document • yi is a category • [xi,yi] means document xi is in category yi • D(x, y) is a distribution over all combinations of xi and yi • D(xi, yj) indicates the “importance” of (xi, yj) • w is the term (automatically found)
Weak Learner Algorithm For each possible category y, compute four values: Note: (xi,y) with greater D(xi,y) has more influence.
Weak Hypothesis h(x, y) • Given unclassified document x and category y • If x contains w, then • Else if x does not contain w, then
Weak Learner Comments • If sign[ h(x,y) ] = +, then x is in y • | h(x,y) |is the confidence • The term w is found as follows: • Repeatedly run weak learner for all possible w • Choose the run with the smallest value as the model • Boosting: Minimizes probability of h(x,y) having wrong sign
Weak Learner • Boosting • Co-Bootstrapping Boosting (AdaBoost.MH)
Boosting Idea • Train the weak learner on different Dt(x, y) distributions • After each run, adjust Dt(x, y) by putting more weight on the most often misclassified training data • Output the final hypothesis as a linear combination of weak hypotheses
Boosting Algorithm Given: [x1, y1], [x2, y2], …, [xm, ym], where xi X and yi Y Initialize D1(x,y) = 1/(mk) fort = 1,…,Tdo Pass distribution Dt to weak learner Get weak hypothesis ht(x, y) Choose t R Update end for Output the final hypothesis
Boosting Algorithm Initialization Given: [x1, y1], [x2, y2], …, [xm, ym] Initialize D(x, y) = 1/(mk) • k = total number of categories • uniform distribution
Boosting Algorithm Loop fort = 1,…,Tdo • Run weak learner using distribution D • Get weak hypothesis ht(x, y) • For each possible pair (x,y) in training data • If ht(x,y) guesses incorrectly, increase D(x,y) end for return
Weak Learner • Boosting • Co-Bootstrapping Co-Bootstrapping
Co-Bootstrapping Idea • We want to use Yahoo! categories to increase classification accuracy
Recall Example Problem • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • Games > Online • EverQuest Addict • Warcraft III Clan • Games > Single-Player • Warcraft III Clan
Co-Bootstrapping Algorithm (1/4) • 2. Run AdaBoost on Google sites • Get classifier G1 • 1. Run AdaBoost on Yahoo! sites • Get classifierY1 • 3. Run Y1 on Google sites • Get predicted Yahoo! categories for Google sites • 4. Run G1 on Yahoo! sites • Get predicted Google categories for Yahoo! sites
Co-Bootstrapping Algorithm (2/4) • 6. Run AdaBoost on Google sites • Include Yahoo! category as a feature • Get classifier G2 • 5. Run AdaBoost on Yahoo! sites • Include Google category as a feature • Get classifierY2 • 7. Run Y2 on original Google sites • get more accurate Yahoo! categories for Google sites • 8. Run G2 on original Yahoo! sites • get more accurateGoogle categories for Yahoo! sites
Co-Bootstrapping Algorithm (3/4) • 10. Run AdaBoost on Google sites • Include Yahoo! category as a feature • Get classifier G3 • 9. Run AdaBoost on Yahoo! sites • Include Google category as a feature • Get classifierY3 • 11. Run Y3 on original Google sites • get even more accurate Yahoo! categories for Google sites • 12. Run G3 on original Yahoo! sites • get even more accurateGoogle categories for Yahoo! sites
Co-Bootstrapping Algorithm (4/4) • Repeat, repeat, and repeat… • Hopefully, the classification will become more accurate after each iteration…
Enhanced Naïve Bayes (Benchmark)
Enhanced Naïve Bayes (1/2) • Given • document x • source category S of x • Predict master category C • In NB, Pr[C | x] Pr[C] wx(Pr[w | C])n(x,w) • w : word • n(x,w) number of occurrences of w in x • Pr[C | x, S] Pr[C | S] wx(Pr[w | C])n(x,w)
Enhanced Naïve Bayes (2/2) • Pr[C] = • Estimate Pr[C | S] • |C S| : number of docs in S that is classified into C by NB classifier
Number of Categories*/Dataset (1/2) *Top level categories only
Number of Categories*/Dataset (2/2) • Book • Horror • Science Fiction • Non-fiction • Biography • History Merge into Non-fiction
Method (1/2) • Classify Yahoo! Book websites into Google Book categories (GY) • Find GY for Book • Hide Google categories for in GY • GYYahoo! Book • Randomly take |GY| sites from G-Y Google Book
Method (2/2) • For each dataset, do GY five times and GY five times • macro F-score : calculate F-score for each category, then average over all categories • micro F-score : calculate F-score on the entire dataset • recall = 100%? • Doesn’t say anything about multi-category ENB
Results (1/3) • Co-Boostrapping-AdaBoost > AdaBoost macro-averaged F scores micro-averaged F scores
Results (2/3) • Co-Bootstrapping-AdaBoost iteratively improves AdaBoost Book Dataset
Results (3/3) • Co-Boostrapping-AdaBoost > Enhanced Naïve Bayes macro-averaged F scores micro-averaged F scores
Contribution • Co-Bootstrapping improves Boosting performance • Does not require as in ENB