Feature Selection & Maximum Entropy

Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012

Roadmap • Feature selection and weighting • Feature weighting • Chi-square feature selection • Chi-square feature selection example • HW #4 • Maximum Entropy • Introduction: Maximum Entropy Principle • Maximum Entropy NLP examples

Feature Selection Recap • Problem: Curse of dimensionality • Data sparseness, computational cost, overfitting

Feature Selection Recap • Problem: Curse of dimensionality • Data sparseness, computational cost, overfitting • Solution: Dimensionality reduction • New feature set r’ s.t. |r’| < |r|

Feature Selection Recap • Problem: Curse of dimensionality • Data sparseness, computational cost, overfitting • Solution: Dimensionality reduction • New feature set r’ s.t. |r’| < |r| • Approaches: • Global & local approaches • Feature extraction: • New features in r’ transformations of features in r

Feature Selection Recap • Problem: Curse of dimensionality • Data sparseness, computational cost, overfitting • Solution: Dimensionality reduction • New feature set r’ s.t. |r’| < |r| • Approaches: • Global & local approaches • Feature extraction: • New features in r’ transformations of features in r • Feature selection: • Wrapper techniques

Feature Selection Recap • Problem: Curse of dimensionality • Data sparseness, computational cost, overfitting • Solution: Dimensionality reduction • New feature set r’ s.t. |r’| < |r| • Approaches: • Global & local approaches • Feature extraction: • New features in r’ transformations of features in r • Feature selection: • Wrapper techniques • Feature scoring

Feature Weighting • For text classification, typical weights include:

Feature Weighting • For text classification, typical weights include: • Binary: weights in {0,1}

Feature Weighting • For text classification, typical weights include: • Binary: weights in {0,1} • Term frequency (tf): • # occurrences of tk in document di

Feature Weighting • For text classification, typical weights include: • Binary: weights in {0,1} • Term frequency (tf): • # occurrences of tk in document di • Inverse document frequency (idf): • dfk: # of docs in which tk appears; N: # docs • idf = log (N/(1+dfk))

Feature Weighting • For text classification, typical weights include: • Binary: weights in {0,1} • Term frequency (tf): • # occurrences of tk in document di • Inverse document frequency (idf): • dfk: # of docs in which tk appears; N: # docs • idf = log (N/(1+dfk)) • tfidf = tf*idf

Chi Square • Tests for presence/absence of relation between random variables

Chi Square • Tests for presence/absence of relation between random variables • Bivariate analysis tests 2 random variables • Can test strength of relationship • (Strictly speaking) doesn’t test direction

Chi Square • Tests for presence/absence of relation between random variables • Bivariate analysis tests 2 random variables • Can test strength of relationship

Chi Square • Tests for presence/absence of relation between random variables • Bivariate analysis tests 2 random variables • Can test strength of relationship • (Strictly speaking) doesn’t test direction

Chi Square Example • Can gender predict shoe choice? Due to F. Xia

Chi Square Example • Can gender predict shoe choice? • A: male/female  Features Due to F. Xia

Chi Square Example • Can gender predict shoe choice? • A: male/female  Features • B: shoe choice  Classes: {sandal, sneaker,…} Due to F. Xia

Comparing Distributions • Observed distribution (O): Due to F. Xia

Comparing Distributions • Observed distribution (O): • Expected distribution (E): Due to F. Xia

Computing Chi Square • Expected value for cell= • row_total*column_total/table_total

Computing Chi Square • Expected value for cell= • row_total*column_total/table_total • X2=(6-9.5)2/9.5+

Computing Chi Square • Expected value for cell= • row_total*column_total/table_total • X2=(6-9.5)2/9.5+(17-11)2/11

Computing Chi Square • Expected value for cell= • row_total*column_total/table_total • X2=(6-9.5)2/9.5+(17-11)2/11+.. • = 14.026

Calculating X2 • Tabulate contigency table of observed values: O

Calculating X2 • Tabulate contigency table of observed values: O • Compute row, column totals

Calculating X2 • Tabulate contigency table of observed values: O • Compute row, column totals • Compute table of expected values, given row/col • Assuming no association

Calculating X2 • Tabulate contigency table of observed values: O • Compute row, column totals • Compute table of expected values, given row/col • Assuming no association • Compute X2

For 2x2 Table • O: • E:

X2 Test • Test whether random variables are independent

X2 Test • Test whether random variables are independent • Null hypothesis: R.V.s are independent

X2 Test • Test whether random variables are independent • Null hypothesis: 2 R.V.s are independent • Compute X2 statistic:

X2 Test • Test whether random variables are independent • Null hypothesis: 2 R.V.s are independent • Compute X2 statistic: • Compute degrees of freedom

X2 Test • Test whether random variables are independent • Null hypothesis: 2 R.V.s are independent • Compute X2 statistic: • Compute degrees of freedom • df = (# rows -1)(# cols -1)

X2 Test • Test whether random variables are independent • Null hypothesis: 2 R.V.s are independent • Compute X2 statistic: • Compute degrees of freedom • df = (# rows -1)(# cols -1) • Shoe example, df = (2-1)(5-1)=4

Feature Selection & Maximum Entropy