Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li

Paper Structure • Introduction • Feature Generation with Wikipedia • Wikipedia as a knowledge Repository • Feature Construction • Feature generator design • Using the link structure • Empirical Evaluation • Implementation Details • Experimental Methodology • The effect of feature generation • Classifying short documents • Conclusions and Future Work

Introduction • Text categorization • Deals with automatic assignment of category labels to natural language documents • Represent document as bags of words • Features from words • Categorization based on features • Limitation of BOW: • by individual word occurrences in the training set • Wal-Mart supply chain goes real time • Wal-Mart manages its stock with RFID technology • Effective in medium difficulty categorization, but bad in small categories or short documents • Using encyclopedia to endow the machine document with the broader of knowledge available to humans

Auxiliary text classifier: • matching documents with the most relevant articles of wikipedia • conventional bag of words + new features • Examples for idea of auxiliary text classifier: • “Bernanke takes charge” • BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, … • Using wikipedia • Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document • Leverage the knowledge gained from these articles

Feature Generation with Wikipedia • Extend the representation of documents for text categorization with knowledge concepts relevant to the document text. • Wikipedia • Largest knowledge repository • Large-scale hierarchies • Qualify, stander written English • …

Feature Construction • Receive a text fragment, and map to most relevant wikipedia articles • E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge • ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS • Training documents -> features -> wikipedia concepts -> augment the bag of word

Feature Construction (cont.) • Unit for feature generation? • Word, sentence, paragraph, document? • Multi-resolution approach • Features are generated for • Individual words • Sentences • Paragraphs • Entire document • Polysemous words is mapped to the concepts that correspond to the sense shared by the context words

Feature Construction example • “jaguar car models”, • the Wikipedia-based feature generator returns: • JAGUAR (CAR), • DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar), • V12 (Jaguar’s engine), • JAGUAR E-TYPE • JAGUAR XJ. • “jaguar Panthera onca”, • JAGUAR, • FELIDAE (feline species family), related felines such as LEOPARD, • PUMA and BLACK PANTHER, as well as KINKAJOU

Feature generator design • A set of simple heuristics for pruning the sets of concepts (wikipedia): • Discarding: • with <100 non stop words • <5 incoming and outgoing links (too short) • disambiguation pages • Each concept is an attribute vector assigned weights using a TF.IDF

Using the link structure • Link—anchor text: • Identical to the canonical name of the target article • Different anchor text refer to the same article: alternative names, variant spellings, and related phrases • Incoming links: significance of an article • Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material • Pursue this direction in future work

Empirical Evaluation • Wikipedia snapshot: November 5, 2005 • 1.8Gb text in 910,989 articles, • removing small and overly specific concepts --remaining 171,332 articles • Removing stop words and rare words • Stemmed • 296,157 distinct terms presenting concepts

Experimental Methodology • 1 Reuter-21578 • 2 Reuters Corpus Volume I (RCV1) • 3 OHSUMED • 4 20 Newsgroups(20NG) • 5 Movie Reviews (Movies) • Method: SVM with a linear kernel • Metrics: • precision-recall break-even point (BEP) • Reuter and OHSUMED: micro- and macro-average BEP • 20 NG and Movies: 4-fold cross-validation

More effective in small categories Improve more

Experiment on short documents Only use title of the articles to do classification

Conclusion and Future work • Feature generator: • identify the most relevant encyclopedia articles • Creating new features • Add semantics to conventional BOW • Latent semantic indexing • LSI + SVM: not good • Wikipedia +svm: improve • Information retrieval

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006

Presentation Transcript

Artificial Intelligence for Games

ARTIFICIAL INTELLIGENCE 2006

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence for Cybersecurity

Artificial Intelligence

Artificial Intelligence Technologies for Web Intelligence

Kira Radinsky, Sagie Davidovich , Shaul Markovitch Technion - Israel Institute of Technology

Languages for Artificial Intelligence

Artificial life and artificial intelligence

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Spring 2006

Artificial Intelligence

Artificial Intelligence for Games

Tools for Artificial Intelligence

Artificial Intelligence For Beginners

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Spring 2006

Artificial Intelligence for kids