130 likes | 261 Views
Text classification Day 35. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. http://www.tulane.edu/~ling/NLP/. Learning to classify text. NLPP §6. Classification. What is it? Supervision
E N D
Text classificationDay 35 LING 681.02 Computational Linguistics Harry Howard Tulane University
Course organization • http://www.tulane.edu/~ling/NLP/ LING 681.02, Prof. Howard, Tulane University
Learning to classify text NLPP §6
Classification • What is it? • Supervision • A classifier is supervised if it is built on training corpora containing the correct label for each input. • This usually means that the program can calculate an error when the predicted label does not match the correct label. • A classifier is unsupervised if it is built on training corpora that does not contain the correct label for each input. • There is no way to calculate an error. LING 681.02, Prof. Howard, Tulane University
Diagram of supervised classification LING 681.02, Prof. Howard, Tulane University
Philosophical question • Does supervised classification work for the majority of stuff that you learned spontaneously as a child? • NO, life does not come neatly labelled. LING 681.02, Prof. Howard, Tulane University
Algorithm • Divide the corpus into three sets: • training set • test set • development (dev-test) set • Choose an initial set of features that will be used to classify the corpus. • The part of the program that looks for the features in the corpus is called a feature extractor. • Train the classifier on the training set. • Run it on the development set. • Refine the feature extractor from any errors produced on the development set. • Run the improved classifier on the test set. LING 681.02, Prof. Howard, Tulane University
Choosing the right features • Use too few, and the data will be underfitted. • The classifier is too vague and makes too many mistakes. • Use too many, and the data will be overfitted. • The classifier is too specific and will not generalize to new examples. LING 681.02, Prof. Howard, Tulane University
Example: gender id • What would the features be? • A female name ends in a, e, i. • A male name ends in k, o, r, s, t. • Explain how classification would work. • NLTK code pp. 223-4. LING 681.02, Prof. Howard, Tulane University
More examples • Classify movie reviews as positive or negative. • How? • Classify POS of words. • How? LING 681.02, Prof. Howard, Tulane University
Beyond the word • Look at word's context. • As we have seen, this is crucial to POS tagging. • Classify IMs as to dialogue acts that they instantiate. • What could be some such acts? • statement, emotion, yes-no question • How? • Recognizing textual entailment • … is the task of determining whether a given piece of text T entails another text called the "hypothesis". • How? LING 681.02, Prof. Howard, Tulane University
RTE example • T: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism. • H: China is a member of SCO. LING 681.02, Prof. Howard, Tulane University
Next time Finish NLPP §6 Go on to NLPP §7 Extracting info from text