350 likes | 372 Views
Evaluation. CSC 576: Data Mining. Today…. Evaluation Metrics: Misclassification Rate Confusion Matrix Precision, Recall, F1 Evaluation Experiments: Cross-Validation Bootstrap Out-of-time Sampling. Evaluation Metrics. How should classifier be quantitatively evaluated?
E N D
Evaluation CSC 576: Data Mining
Today… • Evaluation Metrics: • Misclassification Rate • Confusion Matrix • Precision, Recall, F1 • Evaluation Experiments: • Cross-Validation • Bootstrap • Out-of-time Sampling
Evaluation Metrics • How should classifier be quantitatively evaluated? • Misclassification Rate:
Evaluation Metrics • Issues with misclassification rate? • If Accuracy is used: • Example: in CC fraud domain, there are many more legitimate transactions than fraudulent transactions • Presume only 1% of transactions are fraudulent • Then: • Classifier that predicts every transaction as GOOD would have 99% accuracy! • Seems great, but it’s not…
Alternative Metrics • Notation (for binary classification): + (positive class) - (negative class) • Class imblanace: • rare class vs. majority class
Confusion Matrix: Counts For binary problems, there are 4 possible problems: • True Positive (TP):number of positive examples correctly predicted by model • False Negative (FN):number of positive examples wrongly predicted as negative • False Positive (FP):number of negative examples wrongly predicted as positive • True Negative (TN):number of negative examples correctly predicted
Confusion Matrix: Percentages • True Positive Rate (TPR):fraction of positive examples correctly predicted by model • Also referred to as sensitivity • False Negative Rate (FNR):fraction of positive examples wrongly predicted as negative • False Positive Rate (FPR):fraction of negative examples wrongly predicted as positive • True Negative Rate (TNR):fraction of negative examples correctly predicted • Also referred to as specificity TPR = TP/(TP+FN) FNR = FN/(TP+FN) FPR = FP/(TN+FP) TNR = TN/(TN+FP)
Precision and Recall • Widely used metrics when successful detection of one class is considered more significant than detection of the other classes
Precision • Precision: fraction of records that are positive in the set of records that classifier predicted as positive • Interpretation: higher the precision, the lower the number of false positive errors
Recall • Recall: fraction of positive records that are correctly predicted by classifier • Interpretation: higher the recall, the fewer number of positive records misclassified as negative class
Baseline Models • Often naïve models • Example #1: classify every instance as positive (the rare class) • What is accuracy? Precision? Recall? • Precision = poor; Recall = 100% • Baseline models often maximize one metric (precision, recall) but not the other. • Key challenge: building a model that performs well with both metrics
F1 Measure • F-measure: combines precision and recall into a single metric, using the harmonic mean • Harmonic Mean of two numbers tends to be closer to the smaller of two numbers… • …so the only way F1 is high is for both precision and recall to be high.
Evaluation • How to divide dataset into Training set, Validation set, Testing set? • Sample (randomly partition) • Problems? • Requires enough data to create suitably large enough sets • “Lucky split” is possible • Difficult instances were chosen for the training set • Easy instances put into the testing set
Cross Validation • Widely used alternative to singleTraining Set + Validation Set • Multipleevaluations using different portions of data for training and validating/testing • k-fold Cross Validation • k = number of folds (integer)
k-Fold Cross Validation • Available data is divided into equal-sized folds (partitions) • k separate evaluation experiments: • Training set: k-1 folds • Test set: 1 fold • Repeat k-1 more times, each using a different fold for the test set
10-fold Cross Validation 1/10th of training data X 9 = 90% used for training • Repeat k times • Average results • Each instance will be used once in testing 10% used for testing
Cross-Validation • More computationally expensive • k-fold Cross-Validation • k = number of folds (integer) • k = 10 is common • Leave-One-Out Cross-Validation • Extreme: k = n, where there are n-1 observations in training+validation set • Useful when the amount of data available is too small to have big enough training sets in a k-fold cross-validation. • Significantly more computationally expensive
Leave-One-Out Cross-Validation n folds Training set: n-1 instances Testing set: 1 single instance n instances
Cross-Validation Error Estimate Average the error rate for each fold. In leave-one-out cross-validation, since each test contains only one record, the variance of the estimated performance will be high. (Usually either 100% or 0%.)
ε0 Bootstrapping • Also called .632 bootstrap • Preferred over cross-validation approaches in contexts with very small datasets • < 300 instances • Performs multiple evaluation experiments using slightly different training and testing sets • Repeat k times • Typically, k is set greater than 200 • Many more folds than with k-fold cross-validation
Out-of-time Sampling • Some domains have a time dimension • chronological structure in data • One time span used for training set • another time span for the test set
Out-of-time Sampling • Important that out-of-time sample uses time spans large enough to take into account any cyclical behavior patterns
Average Class Accuracy Two models. Which is better? We’re really interested in identifying the churn customers. Model 1 Accuracy: 91% Model 2 Accuracy: 78% Note the imblanaced test set.
Average Class Accuracy • Using average class accuracy instead of classification accuracy. • Average recall for each possible level of a target class.
Average Class Accuracy Two models. Which is better? We’re really interested in identifying the churn customers. Model 1 Accuracy: 91% Model 2 Accuracy: 78% Model 1: Average Class Accuracy: 55% Model 2: Average Class Accuracy: 79%
Average Class Accuracy (Harmonic Mean) • Using average class accuracy can be computed using the harmonic mean instead of the arithmetic mean. • Arithmetic means are susceptible to influence of large outliers. • The harmonic mean results in a more pessimistic view of model performance than the arithmetic mean.
Measuring Profit and Loss • Should all cells in a confusion matrix have the same weight? Not always correct to treat all outcomes equally.
Measuring Profit and Loss • Small cost incurred by company on false positives. • Large cost incurred by company on false negatives. • No cost on true positives and true negatives. • Take into account the costs of different outcomes…
Measuring Profit and Loss • Calculate a profit matrix, profit or loss from each outcome in the confusion matrix. • Used in overall model evaluation. • Example:
Measuring Profit and Loss Average class accuracyHM = 80.761% Average class accuracyHM = 83.824%
References • Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. • Data Science from Scratch, 1st Edition, Grus • Data Mining and Business Analytics in R, 1st edition, Ledolter • An Introduction to Statistical Learning, 1st edition, James et al. • Discovering Knowledge in Data, 2nd edition, Larose et al. • Introduction to Data Mining, 1st edition, Tam et al.