What are the real challenges in data mining?

What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003

Bogosity about learning with unbalanced data • The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) =  P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important

Learning to predict contact maps 3D protein distance map binary contact map (Source: Paolo Frasconi et al.)

Issues in contact map prediction • An ML researcher sees O(n2) non-contacts and O(n) contacts. • But to a biologist, the concept “an example of a non- contact” is far from natural. • Moreover, there is no natural probability distribution defining the population of “all” proteins. • A statistician sees simply O(n2) distance measures— but s/he finds least-squares regression is useless!

For the rooftop detection task … • We used […] BUDDS, to extract candidate rooftops (I.e. parallelograms) from six-large area images. Such processing resulted in 17,289 candidates, which an expert labeled as 781 positive examples and 17,048 negative examples of the concept “rooftop.” • (Source: Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown, Marcus Maloof, this workshop.)

How to detect faces in real-time? • Viola and Jones, CVPR ‘01: • Slide window over image • 45396 features per window • Learn boosted decision-stump classifier

UCI datasets are small and not highly unbalanced (Source: C4.5 and Imbalanced Data Sets, Nitin Chawla, this workshop.)

Features of the DMEF and similar datasets • At least 105 examples and 102.5 features. • No single well-defined target class. • Interesting cases have frequency < 0.01. • Much information on costs and benefits, but no overall model of profit/loss. • Different cost matrices for different examples • Most cost matrix entries are unknown.

Example-dependent costs and benefits • Observations: • Loss or profit depends on the transaction size x. • Figuring out the full profit/loss model is hard. • Opportunity costs are confusing. • Creative management transforms costs into benefits. • How do we account for long-term costs and benefits?

Correct decisions require correct probabilities • Let p = P(legitimate). The optimal decision is “approve” iff • 0.01xp – (1-p)x > (-20)p + (-10)(1-p) • This calculation requires well-calibrated estimates of p.

ROC curves considered harmful(Source: Medical College of Georgia.) • “AUC can give a general idea of the quality of the probabilistic estimates produced by the model” • No, AUC only evaluates the ranking produced. • “Cost curves are equivalent to ROC curves” • No, a single point on the ROC curve is optimal only if costs are the same for all examples. • Advice: Use $ profit to compare methods. • Issue: When is $ difference statistically significant?

Usually we must learn a model to estimate costs • Cost matrix for soliciting donors to a charity. • The donation amount x is always unknown for test examples, so we must use the training data to learn a regression model to predict x.

So, we learn a model to estimate costs … • Issue: The subset in the training set with x > 0 is a skewed sample for learning a model to estimate x. • Reason: Donation amount x and probability of donation p are inversely correlated. • Hence, the training set contains too few examples of large donations, compared to small ones.

The “reject inference” problem • Let humans make credit grant/deny decisions. • Collect data about repay/write-off, but only for people to whom credit is granted. • Learn a model from this training data. • Apply the model to all future applicants. • Issue: “All future applicants” is a sample from a different population than “people to whom credit is granted.”

Selection bias makes training labels incorrect • In the Wisconsin Prognostic Breast Cancer Database, average survival time with chemotherapy is lower (58.9 months) than without (63.1)! • Historical actions are not optimal, but they are not chosen randomly either. (Source: William H. Wolberg, M.D.)

Sequences of training sets • Use data collected in 2000 to learn a model; apply this model to select inside the 2001 population. • Use data about the individuals selected in 2001 to learn a new model; apply this model in 2002. • And so on… • Each time a new model is learned, its training set is has been created using a different selection bias.

Let’s use the word “unbalanced” in the future • Google: Searched the web for imbalanced. … about 53,800. • Searched the web for unbalanced. … about 465,000.

C. Elkan. The Foundations of Cost-Sensitive Learning IJCAI'01, pp. 973-978. • B. Zadrozny and C. Elkan. Learning and Making Decisions When Costs and Probabilities are Both Unknown KDD'01, pp. 204-213. • B. Zadrozny and C. Elkan. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian ClassifiersICML'01, pp. 609-616. • N. Abe et al. Empirical Comparison of Various Reinforcement Learning Strategies for Sequential Targeted Marketing ICDM'02. • B. Zadrozny, J. Langford, and N. Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting ICDM’03.

What are the real challenges in data mining?

What are the real challenges in data mining?

Presentation Transcript

Data Mining with Clementine

Frequent Item Mining

Real-Time Database Systems and Data Services: Issues and Challenges

CS490D: Introduction to Data Mining Prof. Walid Aref

Drug Safety Assessment and Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

DATA WAREHOUSING AND DATA MINING

Advanced Topics in Data Mining: Association Rules

DATA WAREHOUSING AND DATA MINING

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

CSE 538 Web Search and Mining Web Crawling

Approximate Mining of Consensus Sequential Patterns

Data Mining Tools

Data Mining : Implementations

Data Mining with DB

Medical data mining Linking diseases, drugs, and adverse reactions

Data Mining using Fractals and Power laws

Data Mining with CANape 9.0