390 likes | 540 Views
Importance of Semantic Representation: Dataless Classification. Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign. Text Categorization. Classify the following sentence: Syd Millar was the chairman of the International Rugby Board in 2003.
E N D
Importance of Semantic Representation: Dataless Classification Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign
Text Categorization Classify the following sentence: Syd Millar was the chairman of the International Rugby Board in 2003. Pick a label: Class1 vs. Class2 • Traditionally, we need annotated data to train a classifier
Text Categorization • Humans don’t seem to need labeled data Syd Millar was the chairman of the International Rugby Board in 2003. Pick a label: Sportsvs.Finance Label names carry a lot of information!
Text Categorization Do wereally always need labeled data?
Contributions • We can often go quite far without annotated data • … if we “know” the meaning of text • This works for text categorization • ….and is consistent across different domains
Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains
Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains
Semantic Representation • One common representation is the Bag of Words representation • All text is a vector in the space of words.
Semantic Representation • Explicit Semantic Analysis • [Gabrilovich & Markovitch, 2006, 2007] • Text is a vector in the space of concepts • Concepts are defined by Wikipedia articles
Monetary Policy Apple IPod ESA representation ESA representation International Monetary Fund Monetary policy Economic and Monetary Union Hong Kong Monetary Authority Monetarism Central bank IPod mini IPod photo IPod nano Apple Computer IPod shuffle ITunes Explicit Semantic Analysis: Example Wikipedia article titles
Semantic Representation • Two semantic representations • Bag of words • ESA
Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains
Traditional Text Categorization Labeled corpus Sports Finance Semantic space A classifier
Dataless Classification Labeled corpus Labels Sports Finance What can we do using just the labels?
Dataless Classification New unlabeled document Labels Sports Finance Semantic space
What is Dataless Classification? • Humans don’t need training for classification • Annotatedtraining data not always needed • Look for the meaning of words
What is Dataless Classification? • Humans don’t need training for classification • Annotated training data not always needed • Look for the meaning of words
On-the-fly Classification New unlabeled document Labels Sports Finance Semantic space
On-the-fly Classification • No training data needed • We know the meaning of label names • Pick the label that is closest in meaning to the document • Nearest neighbors
On-the-fly Classification New unlabeled document New labels Hockey Baseball Semantic space
On-the-fly Classification • No need to even know labels before hand • Compare with traditional classification • Annotated training data for each label
Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains
Dataset 1: Twenty Newsgroups • Posts to newsgroups • Newsgroups have descriptive names sci.electronics = Science Electronics rec.motorbikes = Motorbikes
Dataset 2: Yahoo Answers • Posts to Yahoo! Answers • Posts categorized into a two level hierarchy • 20 top level categories • Totally 280 categories at the second level Arts and Humanities, Theater Acting Sports,Rugby League
Experiments • 20 Newsgroups • 10 binary problems (from [Raina et al, ‘06]) Religion vs. Politics.guns Motorcycles vs. MS Windows • Yahoo! Answers • 20 binary problems Health, Diet fitness vs. Health Allergies Consumer Electronics DVRs vs. Pets Rodents
Results: On-the-fly classification Naïve Bayes classifier Uses annotated data, Ignores labels Nearest neighbors, Uses labels, No annotated data
Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains
Using Unlabeled Data • Knowing the data collection helps • We can learn specific biases of the dataset • Potential for semi-supervised learning
Bootstrapping • Each label name is a “labeled” document • One “example” in word or concept space • Train initial classifier • Same as the on-the-fly classifier • Loop: • Classify all documents with current classifier • Retrain classifier with highly confident predictions
Co-training • Words and concepts are two independent “views” • Each view is a teacher for the other [Blum & Mitchell ‘98]
Co-training • Train initial classifiers in word space and concept space • Loop • Classify documents with current classifiers • Retrain with highly confident predictions of both classifiers
Using unlabeled data • Three approaches • Bootstrapping with labels using Bag of Words • Bootstrapping with labels using ESA • Co-training
More Results Co-training using just labels does as well as supervision with 100 examples No annotated data
Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains
Domain Adaptation • Classifiers trained on one domain and tested on another • Performance usually decreases across domains
But the label names are the same • Label names don’t depend on the domain • Label names are robust across domains • On-the-fly classifiers are domain independent
Example Baseball vs. Hockey
Conclusion • Sometimes, label names are tell us more about a class than annotated examples • Standard learning practice of treating labels as unique identifiers loses information • The right semantic representation helps • What is the right one?