150 likes | 290 Views
Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li. Paper Structure. Introduction Feature Generation with Wikipedia
E N D
Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li
Paper Structure • Introduction • Feature Generation with Wikipedia • Wikipedia as a knowledge Repository • Feature Construction • Feature generator design • Using the link structure • Empirical Evaluation • Implementation Details • Experimental Methodology • The effect of feature generation • Classifying short documents • Conclusions and Future Work
Introduction • Text categorization • Deals with automatic assignment of category labels to natural language documents • Represent document as bags of words • Features from words • Categorization based on features • Limitation of BOW: • by individual word occurrences in the training set • Wal-Mart supply chain goes real time • Wal-Mart manages its stock with RFID technology • Effective in medium difficulty categorization, but bad in small categories or short documents • Using encyclopedia to endow the machine document with the broader of knowledge available to humans
Auxiliary text classifier: • matching documents with the most relevant articles of wikipedia • conventional bag of words + new features • Examples for idea of auxiliary text classifier: • “Bernanke takes charge” • BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, … • Using wikipedia • Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document • Leverage the knowledge gained from these articles
Feature Generation with Wikipedia • Extend the representation of documents for text categorization with knowledge concepts relevant to the document text. • Wikipedia • Largest knowledge repository • Large-scale hierarchies • Qualify, stander written English • …
Feature Construction • Receive a text fragment, and map to most relevant wikipedia articles • E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge • ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS • Training documents -> features -> wikipedia concepts -> augment the bag of word
Feature Construction (cont.) • Unit for feature generation? • Word, sentence, paragraph, document? • Multi-resolution approach • Features are generated for • Individual words • Sentences • Paragraphs • Entire document • Polysemous words is mapped to the concepts that correspond to the sense shared by the context words
Feature Construction example • “jaguar car models”, • the Wikipedia-based feature generator returns: • JAGUAR (CAR), • DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar), • V12 (Jaguar’s engine), • JAGUAR E-TYPE • JAGUAR XJ. • “jaguar Panthera onca”, • JAGUAR, • FELIDAE (feline species family), related felines such as LEOPARD, • PUMA and BLACK PANTHER, as well as KINKAJOU
Feature generator design • A set of simple heuristics for pruning the sets of concepts (wikipedia): • Discarding: • with <100 non stop words • <5 incoming and outgoing links (too short) • disambiguation pages • Each concept is an attribute vector assigned weights using a TF.IDF
Using the link structure • Link—anchor text: • Identical to the canonical name of the target article • Different anchor text refer to the same article: alternative names, variant spellings, and related phrases • Incoming links: significance of an article • Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material • Pursue this direction in future work
Empirical Evaluation • Wikipedia snapshot: November 5, 2005 • 1.8Gb text in 910,989 articles, • removing small and overly specific concepts --remaining 171,332 articles • Removing stop words and rare words • Stemmed • 296,157 distinct terms presenting concepts
Experimental Methodology • 1 Reuter-21578 • 2 Reuters Corpus Volume I (RCV1) • 3 OHSUMED • 4 20 Newsgroups(20NG) • 5 Movie Reviews (Movies) • Method: SVM with a linear kernel • Metrics: • precision-recall break-even point (BEP) • Reuter and OHSUMED: micro- and macro-average BEP • 20 NG and Movies: 4-fold cross-validation
More effective in small categories Improve more
Experiment on short documents Only use title of the articles to do classification
Conclusion and Future work • Feature generator: • identify the most relevant encyclopedia articles • Creating new features • Add semantics to conventional BOW • Latent semantic indexing • LSI + SVM: not good • Wikipedia +svm: improve • Information retrieval