A Robust "Black Box" Technique for Pattern Classification

MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420-9108 A Robust "Black Box" Technique forPattern Classification Kelly Wallenstein, John Weatherwax, Virginia Hafer MIT Lincoln Laboratory Presented to Group 32 August 2007

Outline • Problem Introduction • Classification Methods • Single Multidimensional Gaussian • Decision Tree • AdaBoost • “Real World” Application: Email Spam • Conclusions

Pattern Classification • Provide set of data and classes • features: size, shape, color, etc • “Train” computer to associate certain features with particular classes • Introduce new data; have computer sort data into classes based on features • Reduce frequency of misclassification Data, Classes Learn New Data Classify Reduce Error Training Testing

Data Sets “Clouds” “Four-Spiral”

Method 1: Simple Gaussian Quadratic Classification Probability density: Discriminant function: Data Estimate ,  Data Evaluate d(x) >0, Blue <0, Green Training Testing

Method 1: Results Clouds Four-Spiral Error = 0.25 Error = 0.30

Method 2: Decision Tree

Method 2: Decision Tree Error ~ 0.14 (vs. 0.25) Error ~ 0.12 (vs. 0.3)

Method 2: Results Error ~ 0.14 (vs. 0.25) Error ~ 0.12 (vs. 0.3)

Method 3: AdaBoost “Meta” Algorithm (nests existing classification algorithm, or “weak learner”) • Train a portion of available data, create classifier • Test on remaining data • Select a new portion of data to train, after applying a higher weight to previously misclassified instances • Train/test on remaining data to create second classifier • Continue creating more classifiers (“boosts”) using weighted data • 6. Use “majority voting” to combine results of all the classifiers

Method 3: AdaBoost • weak learners  strong learner • Trains on increasingly hard-to-classify instances • Avoids overfitting, or “memorizing the data” Image: Polikar, Robi. "Ensemble Based Systems in Decision Making". IEEE Circuits and Systems Magazine 3rd quarter, 2006: 31.

Method 3: AdaBoost with Decision Tree Clouds Error ~ 0.14 (vs. 0.14, 0.25)

Method 3: AdaBoost with Decision Tree Four-Spiral Error ~ 0.06 (vs. 0.12, 0.30)

Method Comparison 0.30 0.25 0.14 0.14 0.12 0.06 Can we improve upon these results?

Bayes Error • The theoretical “minimum” error • You can’t do better than the Bayes error • Due to fundamental overlap in data

Gaussian Mixture Model Data Estimate i, i Data Evaluate d(x) >0, Blue <0, Green Training Testing

Gaussian Mixture Model: Results 0.25 0.14 0.14 0.10 Perfect knowledge of the distributions achieves the Bayes error

Email Spam Dataset • 4601 emails (39.4% spam, 60.6% non-spam) • 58 Features: • Word/character frequency measures (“money,” “free,” “credit,” “$,” etc) • String length of consecutive capital letters • Number of capital letters in the e-mail • Data from sample email: • 0,0,0,0,1.16,0,0,0,0,0,0,0.58,0,0,0,1.16,0,1.16,1.16,0,1.75,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.133,0,0.667,0,0,1.131,5,69,1 • Spam data set compiled at University of California: Irvine, in 1999

Email Spam: Method Comparison 0.32 0.22 0.17

Conclusions • Important to avoid “overfitting” • Testing error vs. training error • AdaBoost – useful “black box” algorithm • Increases performance for various types of datasets without overfitting • Doesn’t require knowledge of how data was generated (statistically, physically, etc) • Can achieve near-optimal results

Backups

ROC Curves (Have something before this slide to introduce the ROC stuff)

Email Spam: Adaboost Error = 0.17

Email Spam: ROC Curve

A Robust "Black Box" Technique for Pattern Classification