290 likes | 430 Views
Text classification: In Search of a Representation. Stan Matwin School of Information Technology and Engineering University of Ottawa stan@site.uottawa.ca. Outline . Supervised learning=classification ML/DM at U of O Classical approach Attempt at a linguistic representation
E N D
Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa stan@site.uottawa.ca
Outline • Supervised learning=classification • ML/DM at U of O • Classical approach • Attempt at a linguistic representation • N-grams – how to get them? • Labelling and co-learning • Next steps?…
Supervised learning (classification) Given: • a set of training instances T={et}, where each t is a class label : one of the classes C1,…Ck • a concept with k classes C1,…Ck (but the definition of the concept is NOT known) Find: • a description for each class which will perform well in determining (predicting) class membership for unseen instances
Classification • Prevalent practice: examples are represented as vectors of values of attributes • Theoretical wisdom, confirmed empirically: the more examples, the better predictive accuracy
ML/DM at U of O • Learning from imbalanced classes: applications in remote sensing • a relational, rather than propositional representation: learning the maintainability concept • Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB
Why text classification? • Automatic file saving • Internet filters • Recommenders • Information extraction • …
Text classification: standard approach • Remove stop words and markings • remaining words are all attributes • A document becomes a vector <word, frequency> • Train a boolean classifier for each class • Evaluate the results on an unseen sample Bag of words
Text classification: tools • RIPPER A “covering”learner Works well with large sets of binary features • Naïve Bayes Efficient (no search) Simple to program Gives “degree of belief”
“Prior art” • Yang: best results using k-NN: 82.3% microaveraged accuracy • Joachim’s results using Support Vector Machine + unlabelled data • SVM insensitive to high dimensionality, sparseness of examples
SVM in Text classification SVM Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training Transductive SVM Maximum separation Margin for test set
Proposed solution (Sam Scott) • Get noun phrases and/or key phrases (Extractor) and add to the feature list • Add hypernyms
Evaluation (Lewis) • Vary the “loss ratio” parameter • For each parameter value • Learn a hypothesis for each class (binary classification) • Micro-average the confusion matrices (add component-wise) • Compute precision and recall • Interpolate (or extrapolate) to find the point where micro- averaged precision and recall are equal
Results No gain over BW in alternative representations But… Comprehensibility…
Combining classifiers Comparable to best known results (Yang)
Other possibilities • Using hypernyms with a small training set (avoids ambiguous words) • Use Bayes+Ripper in a cascade scheme (Gama) • Other representations:
Collocations • Do not need to be noun phrases, just pairs of words possibly separated by stop words • Only the well discriminating ones are chosen • These are added to the bag of words, and… • Ripper
N-grams • N-grams are substrings of a given length • Good results in Reuters [Mladenic, Grobelnik] with Bayes; we try RIPPER • A different task: classifying text files Attachments Audio/video Coded • From n-grams to relational features
How to get good n-grams? We use Ziv-Lempel for frequent substring detection (.gz!) abababa a b a a b b a
N-grams • Counting • Pruning: substring occurrence ratio < acceptance threshold • Building relations: string A almost always precedes string B • Feeding into relational learner (FOIL)
Using grammar induction (text files) • Idea: detect patterns of substrings • Patterns are regular languages • Methods of automata induction: a recognizer for each class of files • We use a modified version of RPNI2 [Dupont, Miclet]
What’s new… • Work with marked up text (Word, Web) • XML with semantic tags: mixed blessing for DM/TM • Co-learning • Text mining
Co-learning • How to use unlabelled data? Or How to limit the number of examples that need be labelled? • Two classifiers and two redundantly sufficient representations • Train both, run both on test set, • add best predictions to training set
Co-learning • Training set grows as… • …each learner predicts independently due to redundant sufficiency (different representations) • would also work with our learners if we used Bayes? • Would work with classifying emails
Co-learning • Mitchell experimented with the task of classifying web pages (profs, students, courses, projects) – a supervised learning task • Used Anchor text Page contents • Error rate halved (from 11% to 5%)
Cog-sci? • Co- learning seems to be cognitively justified • Model: students learning in groups (pairs) • What other social learning mechanisms could provide models for supervised learning?
Conclusion • A practical task, needs a solution • No satisfactory solution so far • Fruitful ground for research