420 likes | 435 Views
Computational and Statistical Issues in Data-Mining. Yoav Freund Banter Inc. Plan of talk. Two large scale classification problems. Generative versus Predictive modeling Boosting Applications of boosting Computational issues in data-mining. Freund, Mason, Rogers, Pregibon, Cortes 2000.
E N D
Computational and Statistical Issues in Data-Mining Yoav Freund Banter Inc.
Plan of talk • Two large scale classification problems. • Generative versus Predictive modeling • Boosting • Applications of boosting • Computational issues in data-mining.
Freund, Mason, Rogers, Pregibon, Cortes 2000 AT&T customer classification • Distinguish business/residence customers • Classification unavailable for about 30% of known customers. • Calculate a “Buizocity” score • Using statistics from call-detail records • Records contain: • calling number, • called number, • time of day, • length of call.
Massive datasets • 260 Million calls / day • 230 Million telephone numbers to be classified.
Faces Non-Faces Paul Viola’s face recognizer Training data 5000 faces 108 non faces
Application of face detector Many Uses - User Interfaces - Interactive Agents - Security Systems - Video Compression - Image Database Analysis
Male Human Voice Female Toy Example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller
mean1 mean2 var1 var2 Probability Generative modeling Voice Pitch
No. of mistakes Discriminative approach Voice Pitch
mean2 mean1 Probability No. of mistakes Ill-behaved data Voice Pitch
Machine Learning Decision Theory Statistics Traditional Statistics vs. Machine Learning Predictions Actions Data Estimated world state
Non-negative weights sum to 1 Binary label Feature vector (x1,y1,w1),(x2,y2,w2) … (xn,yn,wn) Weighted training set instances labels x1,x2,x3,…,xn y1,y2,y3,…,yn The weak requirement: A weak learner A weak rule weak learner h h
weak learner h1 (x1,y1,w1), … (xn,yn,wn) (x1,y1,1/n), … (xn,yn,1/n) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) weak learner h2 h8 hT h9 h7 h5 h3 h4 h6 The boosting process Sign[] + + + Final rule: a1 h1 a2 h2 aT hT
Main properties of adaboost • If advantages of weak rules over random guessing are: g1,g2,..,gT then in-sample error of final rule is at most (w.r.t. initial weights) • Even after in-sample error reaches zero, additional boosting iterations usually improve out-of-sample error. [Schapire,Freund,Bartlett,Lee Ann. Stat. 98]
What is a good weak learner? • The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. • Small enough to allow exhaustive search for the minimal weighted training error. • Small enough to avoid over-fitting. • Should be able to calculate predicted label very efficiently. • Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).
Image Features Unique Binary Features
Example Classifier for Face Detection A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive... ROC curve for 200 feature classifier
Alternating Trees Joint work with Llew Mason
Y no no yes yes 5 X 3 Decision Trees -1 +1 X>3 -1 Y>5 -1 -1 +1
Y -1 +1 +0.2 -0.1 +0.1 X>3 no yes -1 -0.3 +0.1 sign -0.1 Y>5 no yes -0.3 +0.2 X Decision tree as a sum -0.2 -0.2
Y +0.2 Y<1 -0.1 +0.1 -1 +1 no yes X>3 +0.7 0.0 no yes -0.3 +0.1 -0.1 -1 sign Y>5 no yes -0.3 +0.2 +1 X An alternating decision tree -0.2 +0.7
Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances.
Accuracy Score Precision/recall graphs
“Drinking out of a fire hose” Allan Wilks, 1997
Data aggregation • Front-end systems • Cashier’s system • Telephone switch • Web server • Web-camera “Data warehouse” Massive distributed data streams Analytics
The database bottleneck • Physical limit: disk “seek” takes 0.01 sec • Same time to read/write 10^5 bytes • Same time to perform 10^7 CPU operations • Commercial DBMS are optimized for varying queries and transactions. • Classification tasks require evaluation of fixedqueries on massive data streams.
Working with large flat files • Sort file according to X(“called telephone number”). • Can be done very efficiently for very large files • Counting occurrences becomes efficient because all records for a given X appear in the same disk block. • Randomly permute records • Reading k consecutive records suffices to estimate a few statistics for a few decisions (splitting a node in a decision tree). • Done by sorting on a random number. • “Hancock” – a system for efficient computation of statistical signatures for data streams. http://www.research.att.com/~kfisher/hancock/
Working with data streams • “You get to see each record only once” • Example problem: identify the 10 most popular items for each retail-chain customer over the last 12 months. • To learn more: Stanford’s Stream Dream Team: http://www-db.stanford.edu/sdt/
Download code Front-end systems Upload statistics Analyzing at the source JAVA Code generation Statistics aggregation Analytics
Learn Slowly, Predict Fast! • Buizocity: • 10,000 instances are sufficient for learning. • 300,000,000 have to be labeled (weekly). • Generate ADTree classifier in C, compile it and run it using Hancock.
Scan 50,000 location/scale boxes in each image, 15images per sec. to detect a few faces. Cascaded method minimizes average processing time Training takes a day on a fast parallel machine. T T T IMAGE BOX Classifier 2 Classifier 3 FACE F F F T NON-FACE NON-FACE Classifier 1 NON-FACE F NON-FACE Paul Viola’s face detector:
Summary • Generative vs. Predictive methodology • Boosting • Alternating trees • The database bottleneck • Learning slowly, predicting fast.
Other work 1 • Specialized data compression: • When data is collected in small bins, most bins are empty. • Instead of storing the zeros smart compression dramatically reduces data size. • Model averaging: • Boosting and Bagging make classifiers more stable. • We need theory that does not use Bayesian assumptions. • Closely relates to margin-based analysis of boosting and of SVM. • Zipf’s Law: • Distribution of words in free text is extremely skewed. • Methods should scale exponentially in entropy rather than linearly in number of words.
Other work 2 • Online methods: • Data distribution changes with time. • Online refinement of feature set. • Long-term learning. • Effective label collection • Selective sampling to label only hard cases. • Comparing labels from different people to estimate reliability. • Co-training: different channels train each-other. (Blum, Mitchell, McCallum)
Contact me! • Yoav@banter.com • http://www.cs.huji.ac.il/~yoavf