210 likes | 337 Views
What are the real challenges in data mining?. Charles Elkan University of California, San Diego August 21, 2003. Bogosity about learning with unbalanced data . The goal is yes/no classification. No: ranking, or probability estimation Often, P(c=minority|x) < 0.5 for all examples x
E N D
What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003
Bogosity about learning with unbalanced data • The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) = P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important
Learning to predict contact maps 3D protein distance map binary contact map (Source: Paolo Frasconi et al.)
Issues in contact map prediction • An ML researcher sees O(n2) non-contacts and O(n) contacts. • But to a biologist, the concept “an example of a non- contact” is far from natural. • Moreover, there is no natural probability distribution defining the population of “all” proteins. • A statistician sees simply O(n2) distance measures— but s/he finds least-squares regression is useless!
For the rooftop detection task … • We used […] BUDDS, to extract candidate rooftops (I.e. parallelograms) from six-large area images. Such processing resulted in 17,289 candidates, which an expert labeled as 781 positive examples and 17,048 negative examples of the concept “rooftop.” • (Source: Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown, Marcus Maloof, this workshop.)
How to detect faces in real-time? • Viola and Jones, CVPR ‘01: • Slide window over image • 45396 features per window • Learn boosted decision-stump classifier
UCI datasets are small and not highly unbalanced (Source: C4.5 and Imbalanced Data Sets, Nitin Chawla, this workshop.)
Features of the DMEF and similar datasets • At least 105 examples and 102.5 features. • No single well-defined target class. • Interesting cases have frequency < 0.01. • Much information on costs and benefits, but no overall model of profit/loss. • Different cost matrices for different examples • Most cost matrix entries are unknown.
Example-dependent costs and benefits • Observations: • Loss or profit depends on the transaction size x. • Figuring out the full profit/loss model is hard. • Opportunity costs are confusing. • Creative management transforms costs into benefits. • How do we account for long-term costs and benefits?
Correct decisions require correct probabilities • Let p = P(legitimate). The optimal decision is “approve” iff • 0.01xp – (1-p)x > (-20)p + (-10)(1-p) • This calculation requires well-calibrated estimates of p.
ROC curves considered harmful(Source: Medical College of Georgia.) • “AUC can give a general idea of the quality of the probabilistic estimates produced by the model” • No, AUC only evaluates the ranking produced. • “Cost curves are equivalent to ROC curves” • No, a single point on the ROC curve is optimal only if costs are the same for all examples. • Advice: Use $ profit to compare methods. • Issue: When is $ difference statistically significant?
Usually we must learn a model to estimate costs • Cost matrix for soliciting donors to a charity. • The donation amount x is always unknown for test examples, so we must use the training data to learn a regression model to predict x.
So, we learn a model to estimate costs … • Issue: The subset in the training set with x > 0 is a skewed sample for learning a model to estimate x. • Reason: Donation amount x and probability of donation p are inversely correlated. • Hence, the training set contains too few examples of large donations, compared to small ones.
The “reject inference” problem • Let humans make credit grant/deny decisions. • Collect data about repay/write-off, but only for people to whom credit is granted. • Learn a model from this training data. • Apply the model to all future applicants. • Issue: “All future applicants” is a sample from a different population than “people to whom credit is granted.”
Selection bias makes training labels incorrect • In the Wisconsin Prognostic Breast Cancer Database, average survival time with chemotherapy is lower (58.9 months) than without (63.1)! • Historical actions are not optimal, but they are not chosen randomly either. (Source: William H. Wolberg, M.D.)
Sequences of training sets • Use data collected in 2000 to learn a model; apply this model to select inside the 2001 population. • Use data about the individuals selected in 2001 to learn a new model; apply this model in 2002. • And so on… • Each time a new model is learned, its training set is has been created using a different selection bias.
Let’s use the word “unbalanced” in the future • Google: Searched the web for imbalanced. … about 53,800. • Searched the web for unbalanced. … about 465,000.
C. Elkan. The Foundations of Cost-Sensitive Learning IJCAI'01, pp. 973-978. • B. Zadrozny and C. Elkan. Learning and Making Decisions When Costs and Probabilities are Both Unknown KDD'01, pp. 204-213. • B. Zadrozny and C. Elkan. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian ClassifiersICML'01, pp. 609-616. • N. Abe et al. Empirical Comparison of Various Reinforcement Learning Strategies for Sequential Targeted Marketing ICDM'02. • B. Zadrozny, J. Langford, and N. Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting ICDM’03.