CSA3180: Natural Language Processing

CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification TF.IDF Clustering/Data Mining Linear and Non-Linear Classification Binary Classification Multi-Class Classification CSA3180: Text Processing III

Introduction • Slides partly based on Lectures by Barbara Rosario and Preslav Nakov • Classification • Text categorization (and other applications) • Various issues regarding classification • Clustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification… • Introduce the steps necessary for a classification task • Define classes • Label text • Features • Training and evaluation of a classifier CSA3180: Text Processing III

Classification Goal: Assign ‘objects’ from a universe to two or more classes or categories Problem Object Categories Tagging Word POS Sense Disambiguation Word The word’s senses Information retrieval Document Relevant/not relevant Sentiment classification Document Positive/negative Author identification Document Authors Language identification Document Languages Text Classification Document Topics CSA3180: Text Processing III

Author Identification • They agreed that Mrs. X should only hear of the departure of the family, without being alarmed on the score of the gentleman's conduct; but even this partial communication gave her a great deal of concern, and she bewailed it as exceedingly unlucky that the ladies should happen to go away, just as they were all getting so intimate together. • Gas looming through the fog in divers places in the streets, much as the sun may, from the spongey fields, be seen to loom by husbandman and ploughboy. Most of the shops lighted two hours before their time--as the gas seems to know, for it has a haggard and unwilling look. The raw afternoon is rawest, and the dense fog is densest, and the muddy streets are muddiest near that leaden-headed old obstruction, appropriate ornament for the threshold of a leaden-headed old corporation, Temple Bar. CSA3180: Text Processing III

Author Identification • Jane Austen (1775-1817), Pride and Prejudice • Charles Dickens (1812-70), Bleak House CSA3180: Text Processing III

Author Identification • Federalist papers • 77 short essays written in 1787-1788 by Hamilton, Jay and Madison to persuade NY to ratify the US Constitution; published under a pseudonym • The authorships of 12 papers was in dispute (disputed papers) • In 1964 Mosteller and Wallace* solved the problem • They identified 70 function words as good candidates for authorships analysis • Using statistical inference they concluded the author was Madison CSA3180: Text Processing III

Author IdentificationFunction Words CSA3180: Text Processing III

Author Identification CSA3180: Text Processing III

Language Identification • Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza. • Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen. Universal Declaration of Human Rights, UN, in 363 languages CSA3180: Text Processing III

Language Identification • égaux - French • eguali - Italian • iguales - Spanish • edistämään - Finnish • għ - Maltese CSA3180: Text Processing III

Text Classification • http://news.google.com/ • Reuters • Collection of (21,578) newswire documents. • For research purposes: a standard text collection to compare systems and algorithms • 135 valid topics categories CSA3180: Text Processing III

Reuters Newswire Corpus CSA3180: Text Processing III

Reuters Sample <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter </BODY></TEXT></REUTERS> CSA3180: Text Processing III

Classification vs. Clustering • Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). • Classification is supervised • In Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are • Clustering is unsupervised CSA3180: Text Processing III

Classification Class1 Class2 CSA3180: Text Processing III

Clustering CSA3180: Text Processing III

Categories (Labels, Classes) • Labeling data • 2 problems: • Decide the possible classes (which ones, how many) • Domain and application dependent • http://news.google.com • Label text • Difficult, time consuming, inconsistency between annotators CSA3180: Text Processing III

Reuters • <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> • <DATE> 2-MAR-1987 16:51:43.42</DATE> • <TOPICS><D>livestock</D><D>hog</D></TOPICS> • <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> • <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off • tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. • Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. • A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter • </BODY></TEXT></REUTERS> Why not topic = policy ? CSA3180: Text Processing III

Binary vs. Multi-Class Classification • Binary classification: two classes • Multi-way classification: more than two classes • Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes CSA3180: Text Processing III

Flat vs. Hierarchical Classification • Flat classification: relations between the classes undetermined • Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node CSA3180: Text Processing III

Single vs. Multi-Category Classification • In single-category text classification each text belongs to exactly one category • In multi-category text classification, each text can have zero or more categories CSA3180: Text Processing III

LabeledText class in NLTK • LabeledText class • >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." • >>> label = “sport” • >>> labeled_text = LabeledText(text, label) • >>> labeled_text.text() • “Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix.” • >>> labeled_text.label() • “sport” CSA3180: Text Processing III

NLTK Classifier Interface • classify determines which label is most appropriate for a given text token, and returns a labeled text token with that label. • labels returns the list of category labels that are used by the classifier. • >>> token = Token(“The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”) • >>> my_classifier.classify(token) “The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”/ health >>> my_classifier.labels() • ("sport", "health", "world",…) CSA3180: Text Processing III

Features • >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." • >>> label = “sport” • >>> labeled_text = LabeledText(text, label) • Here the classification takes as input the whole string • What’s the problem with that? • What are the features that could be useful for this example? CSA3180: Text Processing III

Features • Feature: An aspect of the text that is relevant to the task • Some typical features • Words present in text • Frequency of words • Capitalization • Are there NE? • WordNet • Others? CSA3180: Text Processing III

Features • Feature: An aspect of the text that is relevant to the task • Feature value: the realization of the feature in the text • Words present in text : Kerry, Schumacher, China… • Frequency of word: Kerry(10), Schumacher(1)… • Are there dates? Yes/no • Are there PERSONS? Yes/no • Are there ORGANIZATIONS? Yes/no • WordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China) CSA3180: Text Processing III

Features • Boolean (or Binary) Features • Features that generate boolean (binary) values. • Boolean features are the simplest and the most common type of feature. • f1(text) = 1 if text contain “Kerry” 0 otherwise • f2(text) = 1 if text contain PERSON 0 otherwise CSA3180: Text Processing III

Features • Integer Features • Features that generate integer values. • Integer features can be used to give classifiers access to more precise information about the text. • f1(text) = Number of times text contains “Kerry” • f2(text) = Number of times text contains PERSON CSA3180: Text Processing III

Features in NLTK • Feature Detectors • Features can be defined using feature detector functions, which map LabeledTexts to values • Method: detect, which takes a labeled text, and returns a feature value. • >>> def ball(ltext): return (“ball” in ltext.text()) • >>> fdetector = FunctionFeatureDetector(ball) • >>> document1 = "John threw the ball over the fence".split() • >>> fdetector.detect(LabeledText(document1) 1 • >>> document2 = "Mary solved the equation".split() • >>> fdetector.detect(LabeledText(document2) 0 CSA3180: Text Processing III

Features • Linguistic features • Words • lowercase? (should we convert to?) • normalized? (e.g. “texts”  “text”) • Phrases • Word-level n-grams • Character-level n-grams • Punctuation • Part of Speech • Non-linguistic features • document formatting • informative character sequences (e.g. &lt) CSA3180: Text Processing III

When do we need Feature Selection? • If the algorithm cannot handle all possible features • e.g. language identification for 100 languages using all words • text classification using n-grams • Good features can result in higher accuracy • But! Why feature selection? • What if we just keep all features? • Even the unreliable features can be helpful. • But we need to weight them: • In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection! CSA3180: Text Processing III

Why do we need Feature Selection? • Not all features are equally good! • Bad features: best to remove • Infrequent • unlikely to be be met again • co-occurrence with a class can be due to chance • Too frequent • mostly function words • Uniform across all categories • Good features: should be kept • Co-occur with a particular category • Do not co-occur with other categories • The rest: good to keep CSA3180: Text Processing III

What types of Feature Selection? • Feature selection reduces the number of features • Usually: • Eliminating features • Weighting features • Normalizing features • Sometimes by transforming parameters • e.g. Latent Semantic Indexing using Singular Value Decomposition • Method may depend on problem type • For classification and filtering, may use information from example documents to guide selection CSA3180: Text Processing III

What types of Feature Selection? • Task independent methods • Document Frequency (DF) • Term Strength (TS) • Task-dependent methods • Information Gain (IG) • Mutual Information (MI) • 2 statistic (CHI) Empirically compared by Yang & Pedersen (1997) CSA3180: Text Processing III

Document Frequency (DF) What about the frequent terms? DF: number of documents a term appears in • Based on Zipf’s Law • Remove the rare terms: (met 1-2 times) • Non-informative • Unreliable – can be just noise • Not influential in the final decision • Unlikely to appear in new documents • Plus • Easy to compute • Task independent: do not need to know the classes • Minus • Ad hoc criterion • Rare terms can be good discriminators (e.g., in IR) What is a “rare” term? CSA3180: Text Processing III

Stop Word Removal • Common words from a predefined list • Mostly from closed-class categories: • unlikely to have a new word added • include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles • But also some open-class words like numerals • Bad discriminators • uniformly spread across all classes • can be safely removed from the vocabulary • Is this always a good idea? (e.g. author identification) CSA3180: Text Processing III

2 statistic (CHI) • 2 statistic (pronounced “kai square”) • The most commonly used method of comparing proportions. • Checks whether there is a relationship between being in one of two groups and a characteristic under study. • Example: Let us measure the dependency between a term t and a category c. • the groups would be: • 1) the documents from a category ci • 2) all other documents • the characteristic would be: • “document contains term t” CSA3180: Text Processing III

2 statistic (CHI) Is “jaguar” a good predictor for the “auto” class? We want to compare: • the observed distribution above; and • null hypothesis: that jaguar and auto are independent CSA3180: Text Processing III

2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? • We would have: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N  Pr(j,a), i.e. N Pr(j) Pr(a) • Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25 CSA3180: Text Processing III

2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? • We would have: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N  Pr(j,a), i.e. N Pr(j) Pr(a) • Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25 expected: fe observed: fo CSA3180: Text Processing III

2 statistic (CHI) 2 is interested in(fo– fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). expected: fe observed: fo CSA3180: Text Processing III

2 statistic (CHI) How to use 2 for multiple categories? Compute 2 for each category and then combine: • we can require to discriminate well across all categories, then we need to take the expected value of 2: or to discriminate well for a single category, then we take the maximum: CSA3180: Text Processing III

2 statistic (CHI) Pros • normalized and thus comparable across terms • 2(t,c) is 0, when t and c are independent • can be compared to 2 distribution, 1 degree of freedom • Cons • unreliable for low frequency terms • computationally expensive CSA3180: Text Processing III

CSA3180: Natural Language Processing