260 likes | 356 Views
Classifying Tags Using Open Content Resources. Simon Overell , Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09. Motivation. Classify tags in Flickr as broad categories such as what , where , when and who Easier indexing and navigation
E N D
Classifying Tags Using Open Content Resources Simon Overell, BorkurSigurbjornsson & Roelof van Zwol WSDM ‘09
Motivation • Classify tags in Flickr as broad categories such as what, where, when and who • Easier indexing and navigation • WordNet is usually used for classification but has limited coverage
Classifying Wikipedia Articles • Using only metadata (i.e. Categories and Templates) – high scalability • Supervised Classifier • Articles as objects • WordNet noun semantic categories as classification classes • Categories and Templates as features • Support Vector Machine (SVM) as classifier
Supervised Classification • Ground Truth • All Wikipedia articles that match WordNet nouns • Data Sparsity • WordNet categories under represented (10 out of 25) • Articles have very few features
Reducing Data Sparsity • Using category and template network transclusion • … but noise is added
System Optimization • Number of arcs traversed in • Category network • Template network • Choice of weighting function • Term Frequency (tf) • Term Frequency – Inverse Document Frequency (tf-idf) • Term Frequency – Inverse Layer (tf-il)
Fine Tuning • Partitioned the ground truth into training and test sets • Criteria • At least 80% precision • Maximum possible recall • Resulted optimal values • Category arcs: 3, Template arcs: 3, TF-IL • Precision: 87% F1-Measure:0.696
SVM Threshold • SVM outputs confidence with which an article is correctly classified as a member of a category • Training experiment with 250 Wikipedia articles (1 assessor)
Summary • Optimised for Recall (ClassTag) • 39% of Articles classified • 664,770 Wikipedia articles • Optimised for Precision (ClassTag+) • 21% of Articles classified • 338,061 Wikipedia articles
Comparison with DBpedia • Experimental Setup • 300 pooled articles • 3 Assessors • Blind Assessments • 50 articles overlap • Partial Agreement: • 86% • Total Agreement: • 78%
Classification of Flickr Tags • Tag Anchor Text • String matching • Anchor Text Wikipedia Article • Number of times an anchor refers to a Wikipedia article • Wikipedia Article Category • Output of SVM decision
Ambiguity • Tag Anchor Text • Some ambiguity because often tags are lower case with no white spaces • Anchor Text Wikipedia Article • 13.4% of Anchor text -> Wikipedia Article mappings ambiguous • 4% of Anchor text -> Category mappings ambiguous • Example • George Bush -> George W. Bush, George Bush Senior • George Bush -> Person • Wikipedia Article Category • 5.7% of classified articles result in multiple classification
Evaluation • WordNet classification extended vocabulary coverage by 115% • Taking tag frequency into account • ClassTag classified 69.2% of Flickr tags • 22% more than WordNet baseline
Multilanguage Classification • 80% of tags in English, 7% in German and 6% in Dutch • Maybe a portion of the unclassified tags fall into this category • Possible alternate language classification • Run ClassTag using alternate Wikipedia language and a corresponding lexicon • Translate the English classification using Wikipedia’s interlanguage links
Contributions • Classifying open content resources using their structural patterns • Presenting ClassTag- a system for classifying tags • ClassTag extends the WordNet lexicon using the structural patterns of Wikipedia
Conclusion • Tuneable system for classifying Wikipedia pages • ClassTag: Nearly 40% of articles classified with a precision of 72% • ClassTag+: 21% of articles classified with a precision of 86% (equal to assessor agreement) • Nearly 70% of Flickr tags matched to WordNet categories