Web Taxonomy Integration through Co-Bootstrapping

Web Taxonomy Integration throughCo-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04

Introduction

Problem Statement • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • EverQuest Addict • Warcraft III Clan • Games > Strategy • Shogun: Total War • Warcraft III Clan • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • Games > Online • EverQuest Addict • Warcraft III Clan • Games > Single-Player • Warcraft III Clan

Possible Approach Classify Train • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • EverQuest Addict • Warcraft III Clan ignores original Yahoo! categories

Another Approach (1/2) • Use Yahoo! categories • Advantage • similar categories • Potential Problem • different structure • categories do not match exactly

Another Approach (2/2) • Example: Crayon Shin-chan Entertainment > Comics and Animation > Animation > Anime > Titles > Crayon Shin-chan Arts > Animation > Anime > Titles > C > Crayon Shin-chan

This Paper’s Approach • Weak Learner (as opposed to Naïve Bayes) • Boosting to combine Weak Hypotheses • New Idea: Co-Bootstrapping to exploit source categories

Assumptions • Multi-category data are reduced to binary data • Totoro Fan Cartoon > My Neighbor Totoro Toys > My Neighbor Totoro is converted into Totoro Fan Cartoon > My Neighbor Totoro Totoro Fan Toys > My Neighbor Totoro • Hierarchies are ignored • Console > Sega and Console > Sega > Dreamcast are not related

Weak Learner • Boosting • Co-Bootstrapping Weak Learner

Weak Learner • A type of classifier similar to Naïve Bayes • + = accept • - = reject • Term may be a word or n-gram or … Weak Hypothesis (term-based classifier) After Training Weak Learner

Weak Hypothesis Example • contain “Crayon Shin-chan”  • in “Comics > Crayon Shin-chan” • not in “Education > Early Childhood” • not contain “Crayon Shin-chan”  • not in “Comics > Crayon Shin-chan” • in “Education > Early Childhood”

Weak Learner Inputs (1/2) • Training data are in the form [x1, y1], [x2, y2], …, [xm, ym] • xiis a document • yi is a category • [xi,yi] means document xi is in category yi • D(x, y) is a distribution over all combinations of xi and yi • D(xi, yj) indicates the “importance” of (xi, yj) • w is the term (automatically found)

Weak Learner Algorithm For each possible category y, compute four values: Note: (xi,y) with greater D(xi,y) has more influence.

Weak Hypothesis h(x, y) • Given unclassified document x and category y • If x contains w, then • Else if x does not contain w, then

Weak Learner Comments • If sign[ h(x,y) ] = +, then x is in y • | h(x,y) |is the confidence • The term w is found as follows: • Repeatedly run weak learner for all possible w • Choose the run with the smallest value as the model • Boosting: Minimizes probability of h(x,y) having wrong sign

Weak Learner • Boosting • Co-Bootstrapping Boosting (AdaBoost.MH)

Boosting Idea • Train the weak learner on different Dt(x, y) distributions • After each run, adjust Dt(x, y) by putting more weight on the most often misclassified training data • Output the final hypothesis as a linear combination of weak hypotheses

Boosting Algorithm Given: [x1, y1], [x2, y2], …, [xm, ym], where xi X and yi  Y Initialize D1(x,y) = 1/(mk) fort = 1,…,Tdo Pass distribution Dt to weak learner Get weak hypothesis ht(x, y) Choose t  R Update end for Output the final hypothesis

Boosting Algorithm Initialization Given: [x1, y1], [x2, y2], …, [xm, ym] Initialize D(x, y) = 1/(mk) • k = total number of categories • uniform distribution

Boosting Algorithm Loop fort = 1,…,Tdo • Run weak learner using distribution D • Get weak hypothesis ht(x, y) • For each possible pair (x,y) in training data • If ht(x,y) guesses incorrectly, increase D(x,y) end for return

Weak Learner • Boosting • Co-Bootstrapping Co-Bootstrapping

Co-Bootstrapping Idea • We want to use Yahoo! categories to increase classification accuracy

Recall Example Problem • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • Games > Online • EverQuest Addict • Warcraft III Clan • Games > Single-Player • Warcraft III Clan

Co-Bootstrapping Algorithm (1/4) • 2. Run AdaBoost on Google sites • Get classifier G1 • 1. Run AdaBoost on Yahoo! sites • Get classifierY1 • 3. Run Y1 on Google sites • Get predicted Yahoo! categories for Google sites • 4. Run G1 on Yahoo! sites • Get predicted Google categories for Yahoo! sites

Co-Bootstrapping Algorithm (2/4) • 6. Run AdaBoost on Google sites • Include Yahoo! category as a feature • Get classifier G2 • 5. Run AdaBoost on Yahoo! sites • Include Google category as a feature • Get classifierY2 • 7. Run Y2 on original Google sites • get more accurate Yahoo! categories for Google sites • 8. Run G2 on original Yahoo! sites • get more accurateGoogle categories for Yahoo! sites

Co-Bootstrapping Algorithm (3/4) • 10. Run AdaBoost on Google sites • Include Yahoo! category as a feature • Get classifier G3 • 9. Run AdaBoost on Yahoo! sites • Include Google category as a feature • Get classifierY3 • 11. Run Y3 on original Google sites • get even more accurate Yahoo! categories for Google sites • 12. Run G3 on original Yahoo! sites • get even more accurateGoogle categories for Yahoo! sites

Co-Bootstrapping Algorithm (4/4) • Repeat, repeat, and repeat… • Hopefully, the classification will become more accurate after each iteration…

Enhanced Naïve Bayes (Benchmark)

Enhanced Naïve Bayes (2/2) • Pr[C] = • Estimate Pr[C | S]  • |C  S| : number of docs in S that is classified into C by NB classifier

Experiment

Datasets

Number of Categories*/Dataset (1/2) *Top level categories only

Number of Categories*/Dataset (2/2) • Book • Horror • Science Fiction • Non-fiction • Biography • History Merge into Non-fiction

Number of Websites

Method (1/2) • Classify Yahoo! Book websites into Google Book categories (GY) • Find GY for Book • Hide Google categories for in GY • GYYahoo! Book • Randomly take |GY| sites from G-Y Google Book

Method (2/2) • For each dataset, do GY five times and GY five times • macro F-score : calculate F-score for each category, then average over all categories • micro F-score : calculate F-score on the entire dataset • recall = 100%? • Doesn’t say anything about multi-category ENB

Results (1/3) • Co-Boostrapping-AdaBoost > AdaBoost macro-averaged F scores micro-averaged F scores

Results (2/3) • Co-Bootstrapping-AdaBoost iteratively improves AdaBoost Book Dataset

Results (3/3) • Co-Boostrapping-AdaBoost > Enhanced Naïve Bayes macro-averaged F scores micro-averaged F scores

Contribution • Co-Bootstrapping improves Boosting performance • Does not require  as in ENB

Web Taxonomy Integration through Co-Bootstrapping

Web Taxonomy Integration through Co-Bootstrapping

Presentation Transcript

Bootstrapping

Web Integration

Bootstrapping

Bootstrapping

Integration through Care Co-ordination

Web Spam Taxonomy

Bootstrapping

Bootstrapping

Web Frameworks Taxonomy

Community Integration through Co-operative Education

Bootstrapping Pay-As-You-Go Data Integration Systems

Learning through Integration

Integration Through Education

Bootstrapping

Integration through sport.

Bootstrapping

Web Based Taxonomy Developer

Bootstrapping

WP5 Cognitive Integration and Bootstrapping

Web Based Taxonomy Developer