350 likes | 426 Views
SIMS 290-2: Applied Natural Language Processing. Preslav Nakov Sept 29, 2004. Today. Feature selection TF.IDF Term Weighting Term Normalization. Features for Text Categorization. Linguistic features Words lowercase? (should we convert to?) normalized? (e.g. “texts” “text” )
E N D
SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004
Today • Feature selection • TF.IDF Term Weighting • Term Normalization
Features for Text Categorization • Linguistic features • Words • lowercase? (should we convert to?) • normalized? (e.g. “texts” “text”) • Phrases • Word-level n-grams • Character-level n-grams • Punctuation • Part of Speech • Non-linguistic features • document formatting • informative character sequences (e.g. <)
When Do We NeedFeature Selection? • If the algorithm cannot handle all possible features • e.g. language identification for 100 languages using all words • text classification using n-grams • Good features can result in higher accuracy • But! Why feature selection? • What if we just keep all features? • Even the unreliable features can be helpful. • But we need to weight them: • In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection!
Why Feature Selection? • Not all features are equally good! • Bad features: best to remove • Infrequent • unlikely to be be met again • co-occurrence with a class can be due to chance • Too frequent • mostly function words • Uniform across all categories • Good features: should be kept • Co-occur with a particular category • Do not co-occur with other categories • The rest: good to keep
Types Of Feature Selection? • Feature selection reduces the number of features • Usually: • Eliminating features • Weighting features • Normalizing features • Sometimes by transforming parameters • e.g. Latent Semantic Indexing using Singular Value Decomposition • Method may depend on problem type • For classification and filtering, may use information from example documents to guide selection
Feature Selection • Task independent methods • Document Frequency (DF) • Term Strength (TS) • Task-dependent methods • Information Gain (IG) • Mutual Information (MI) • 2 statistic (CHI) Empirically compared by Yang & Pedersen (1997)
Pedersen & Yang Experiments • Compared feature selection methods for text categorization • 5 feature selection methods: • DF, MI, CHI, (IG, TS) • Features were just words • 2 classifiers: • kNN: k-Nearest Neighbor (to be covered next week) • LLSF: Linear Least Squares Fit • 2 data collections: • Reuters-22173 • OHSUMED: subset of MEDLINE (1990&1991 used)
Document Frequency (DF) DF: number of documents a term appears in • Based on Zipf’s Law • Remove the rare terms: (met 1-2 times) • Non-informative • Unreliable – can be just noise • Not influential in the final decision • Unlikely to appear in new documents • Plus • Easy to compute • Task independent: do not need to know the classes • Minus • Ad hoc criterion • Rare terms can be good discriminators (e.g., in IR) What about the frequent terms? What is a “rare” term?
Examples of Frequent Words:Most Frequent Words in Brown Corpus
Stop Word Removal • Common words from a predefined list • Mostly from closed-class categories: • unlikely to have a new word added • include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles • But also some open-class words like numerals • Bad discriminators • uniformly spread across all classes • can be safely removed from the vocabulary • Is this always a good idea? (e.g. author identification)
2 statistic (CHI) • 2 statistic (pronounced “kai square”) • The most commonly used method of comparing proportions. • Checks whether there is a relationship between being in one of two groups and a characteristic under study. • Example: Let us measure the dependency between a term t and a category c. • the groups would be: • 1) the documents from a category ci • 2) all other documents • the characteristic would be: • “document contains term t”
2 statistic (CHI) Is “jaguar” a good predictor for the “auto” class? We want to compare: • the observed distribution above; and • null hypothesis: that jaguar and auto are independent
2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? • We would have: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a) • Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which is: N(5/N)(502/N)=2510/N=2510/10005 0.25
2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? • We would have: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a) • Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which is: N(5/N)(502/N)=2510/N=2510/10005 0.25 expected: fe observed: fo
2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? • We would have: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a) • Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which is: N(5/N)(502/N)=2510/N=2510/10005 0.25 expected: fe observed: fo
2 statistic (CHI) 2 is interested in(fo– fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). expected: fe observed: fo
2 statistic (CHI) There is a simpler formula for 2: N = A + B + C + D
2 statistic (CHI) How to use 2 for multiple categories? Compute 2 for each category and then combine: • we can require to discriminate well across all categories, then we need to take the expected value of 2: • or to discriminate well for a single category, then we take the maximum:
2 statistic (CHI) • Plus • normalized and thus comparable across terms • 2(t,c) is 0, when t and c are independent • can be compared to 2distribution, 1 degree of freedom • Minus • unreliable for low frequency terms • computationally expensive
Information Gain • A measure of importance of the feature for predicting the presence of the class. • Defined as: • The number of “bits of information” gained by knowing the term is present or absent • Based on Information Theory • We won’t go into this in detail here. • Plus: • sound information theory justification • Minus: • computationally expensive
Information Gain (IG) IG: number of bits of information gained by knowing the term is present or absent t is the term being scored, ci is a class variable entropy: H(c) specific conditional entropy H(c|t) specific conditional entropy H(c|¬t)
Mutual Information (MI) • The probability of seeing x followed by y vs. the probably of seeing x anywhere times the probability of seeing y anywhere. • log ( P(x,y) / P(x)P(y) )
Mutual Information (MI) rare terms scored higher Approximation: does not use term absence
Using Mutual Information • Compute MI for each category and then combine • If we want to discriminate well across all categories, then we need to take the expected value of MI: • To discriminate well for a single category, then we take the maximum:
Mutual Information • Plus • I(t,c) is 0, when t and c are independent • Sound information-theoretic interpretation • Minus • Small numbers produce unreliable results • Computationally expensive • Does not use term absence
Term strength Mutual information
Comparison: DF,TS,IG,CHI,MI DF, IG and CHI are good and strongly correlated • thus using DF is good, cheap and task independent • can be used when IG and CHI are too expensive • MI is bad • favors rare terms (which are typically bad) • MI vs. IG mutual information information gain
Term Weighting • In the study just shown, terms were (mainly) treated as binary features • If a term occurred in a document, it was assigned 1 • Else 0 • Often it us useful to weight the selected features • Standard technique: tf.idf
TF.IDF Term Weighting • TF: term frequency • definition: TF = tij • frequency of term i in document j • purpose: makes the frequent words for the document more important • IDF: inverted document frequency • definition: IDF = log(N/ni) • ni : number of documents containing term i • N : total number of documents • purpose: makes rare words across documents more important • TF.IDF • definition: tij log(N/ni)
Term Normalization • Combine different words into a single representation • Stemming/morphological analysis • bought, buy, buys -> buy • General word categories • $23.45, 5.30 Yen -> MONEY • 1984, 10,000 -> DATE, NUM • PERSON • ORGANIZATION • (Covered in Information Extraction segment) • Generalize with lexical hierarchies • WordNet, MeSH • (Covered later in the semester)
Stemming & Lemmatization • Purpose: conflate morphological variants of a word to a single index term • Stemming: normalize to a pseudoword • e.g. “more” and “morals” become “mor” (Porter stemmer) • Lemmatization: convert to the root form • e.g. “more” and “morals” become “more” and “moral” • Plus: • vocabulary size reduction • data sparseness reduction • Minus: • loses important features (even to_lowercase() can be bad!) • questionable utility (maybe just “-s”, “-ing” and “-ed”?)
What Do People Do In Practice? • Feature selection • infrequent term removal • infrequent across the whole collection (i.e. DF) • met in a single document • most frequent term removal (i.e. stop words) • Normalization: • Stemming. (often) • Word classes (sometimes) • Feature weighting: TF.IDF or IDF • Dimensionality reduction. (occasionally)
Summary • Feature selection • Task independent methods: DF, TS • Task dependent: IG, MI, 2 statistic • Term weighting • IDF • TF.IDF • Term normalization
References • Feature Selection • Yang Y., J. Pedersen. A comparative study on feature selection in text categorization. In J. D. H. Fisher, editor, The Fourteenth International Conference on Machine Learning (ICML'97), pages 412-420. Morgan Kaufmann, 1997. • Term Weighting • Salton G., C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management: an International Journal, v.24 n.5, p.513-523, 1988. • Salton, G. 1989. Automatic text processing. Chapter 9.