290 likes | 313 Views
MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420-9108. A Robust "Black Box" Technique for Pattern Classification. Kelly Wallenstein, John Weatherwax, Virginia Hafer MIT Lincoln Laboratory Presented to Group 32 August 2007. Outline. Problem Introduction Classification Methods
E N D
MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420-9108 A Robust "Black Box" Technique forPattern Classification Kelly Wallenstein, John Weatherwax, Virginia Hafer MIT Lincoln Laboratory Presented to Group 32 August 2007
Outline • Problem Introduction • Classification Methods • Single Multidimensional Gaussian • Decision Tree • AdaBoost • “Real World” Application: Email Spam • Conclusions
Pattern Classification • Provide set of data and classes • features: size, shape, color, etc • “Train” computer to associate certain features with particular classes • Introduce new data; have computer sort data into classes based on features • Reduce frequency of misclassification Data, Classes Learn New Data Classify Reduce Error Training Testing
Data Sets “Clouds” “Four-Spiral”
Outline • Problem Introduction • Classification Methods • Single Multidimensional Gaussian • Decision Tree • AdaBoost • “Real World” Application: Email Spam • Conclusions
Method 1: Simple Gaussian Quadratic Classification Probability density: Discriminant function: Data Estimate , Data Evaluate d(x) >0, Blue <0, Green Training Testing
Method 1: Results Clouds Four-Spiral Error = 0.25 Error = 0.30
Outline • Problem Introduction • Classification Methods • Single Multidimensional Gaussian • Decision Tree • AdaBoost • “Real World” Application: Email Spam • Conclusions
Method 2: Decision Tree Error ~ 0.14 (vs. 0.25) Error ~ 0.12 (vs. 0.3)
Method 2: Results Error ~ 0.14 (vs. 0.25) Error ~ 0.12 (vs. 0.3)
Outline • Problem Introduction • Classification Methods • Single Multidimensional Gaussian • Decision Tree • AdaBoost • “Real World” Application: Email Spam • Conclusions
Method 3: AdaBoost “Meta” Algorithm (nests existing classification algorithm, or “weak learner”) • Train a portion of available data, create classifier • Test on remaining data • Select a new portion of data to train, after applying a higher weight to previously misclassified instances • Train/test on remaining data to create second classifier • Continue creating more classifiers (“boosts”) using weighted data • 6. Use “majority voting” to combine results of all the classifiers
Method 3: AdaBoost • weak learners strong learner • Trains on increasingly hard-to-classify instances • Avoids overfitting, or “memorizing the data” Image: Polikar, Robi. "Ensemble Based Systems in Decision Making". IEEE Circuits and Systems Magazine 3rd quarter, 2006: 31.
Method 3: AdaBoost with Decision Tree Clouds Error ~ 0.14 (vs. 0.14, 0.25)
Method 3: AdaBoost with Decision Tree Four-Spiral Error ~ 0.06 (vs. 0.12, 0.30)
Method Comparison 0.30 0.25 0.14 0.14 0.12 0.06 Can we improve upon these results?
Bayes Error • The theoretical “minimum” error • You can’t do better than the Bayes error • Due to fundamental overlap in data
Gaussian Mixture Model Data Estimate i, i Data Evaluate d(x) >0, Blue <0, Green Training Testing
Gaussian Mixture Model: Results 0.25 0.14 0.14 0.10 Perfect knowledge of the distributions achieves the Bayes error
Outline • Problem Introduction • Classification Methods • Single Multidimensional Gaussian • Decision Tree • AdaBoost • “Real World” Application: Email Spam • Conclusions
Email Spam Dataset • 4601 emails (39.4% spam, 60.6% non-spam) • 58 Features: • Word/character frequency measures (“money,” “free,” “credit,” “$,” etc) • String length of consecutive capital letters • Number of capital letters in the e-mail • Data from sample email: • 0,0,0,0,1.16,0,0,0,0,0,0,0.58,0,0,0,1.16,0,1.16,1.16,0,1.75,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.133,0,0.667,0,0,1.131,5,69,1 • Spam data set compiled at University of California: Irvine, in 1999
Email Spam: Method Comparison 0.32 0.22 0.17
Conclusions • Important to avoid “overfitting” • Testing error vs. training error • AdaBoost – useful “black box” algorithm • Increases performance for various types of datasets without overfitting • Doesn’t require knowledge of how data was generated (statistically, physically, etc) • Can achieve near-optimal results
ROC Curves (Have something before this slide to introduce the ROC stuff)
Email Spam: Adaboost Error = 0.17