140 likes | 234 Views
Evaluating What’s Been Learned. Cross-Validation. Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for training Separation should NOT be “convenience”, Should at least be random
E N D
Cross-Validation • Foundation is a simple idea – “holdout” – holds out a certain amount for testing and uses rest for training • Separation should NOT be “convenience”, • Should at least be random • Better – “stratified” random – division preserves relative proportion of classes in both training and test data • Enhanced : repeated holdout • Enables using more data in training, while still getting a good test • 10-fold cross validation has become standard • This is improved if the folds are chosen in a “stratified” random way
For Small Datasets • Leave One Out • Bootstrapping • To be discussed in turn
Leave One Out • Train on all but one instance, test on that one (pct correct always equals 100% or 0%) • Repeat until have tested on all instances, average results • Really equivalent to N-fold cross validation where N = number of instances available • Plusses: • Always trains on maximum possible training data (without cheating) • Efficient to run – no repeated (since fold contents not randomized) • No stratification, no random sampling necessary • Minuses • Guarantees a non-stratified sample – the correct class will always be at least a little bit under-represented in the training data • Statistical tests are not appropriate
Bootstrapping • Sampling done with replacement to form a training dataset • Particular approach – 0.632 bootstrap • Dataset of n instances is sampled n times • Some instances will be included multiple times • Those not picked will be used as test data • On large enough dataset, .632 of the data instances will end up in the training dataset, rest will be in test • This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation) • May try to balance by weighting in performance predicting training data (p 129) <but this doesn’t seem fair> • This procedure can be repeated any number of times, allowing statistical tests
Counting the Cost • Some mistakes are more costly to make than others • Giving a loan to a defaulter is more costly than denying somebody who would be a good customer • Sending mail solicitation to somebody who won’t buy is less costly than missing somebody who would buy (opportunity cost) • Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions) • Measurement could be average profit/ loss per prediction • To be fair in cost benefit analysis, should also factor in cost of collecting and preparing the data, building the model …
Lift Charts • In practice, costs are frequently not known • Decisions may be made by comparing possible scenarios • Book Example – Promotional Mailing • Situation 1 – previous experience predicts that 0.1% of all (1,000,000) households will respond • Situation 2 – classifier predicts that 0.4% of the 100000 most promising households will respond • Situation 3 – classifier predicts that 0.2% of the 400000 most promising households will respond • The increase in response rate is the lift ( 0.4 / 0.1 = 4 in situation 2 compared to sending to all)
Information Retrieval (IR) Measures • E.g., Given a WWW search, a search engine produces a list of hits supposedly relevant • Which is better? • Retrieving 100, of which 40 are actually relevant • Retrieving 400, of which 80 are actually relevant • Really depends on the costs
Information Retrieval (IR) Measures • IR community has developed 3 measures: • Recall = number of documents retrieved that are relevant total number of documents that are relevant • Precision = number of documents retrieved that are relevant total number of documents that are retrieved • F-measure = 2 * recall * precision recall + precision
WEKA • Part of the results provided by WEKA (that we’ve ignored so far) • Let’s look at an example (Naïve Bayes on my-weather-nominal) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.667 0.125 0.8 0.667 0.727 yes 0.875 0.333 0.778 0.875 0.824 no === Confusion Matrix === a b <-- classified as 4 2 | a = yes 1 7 | b = no • TP rate and recall are the same = TP / (TP + FN) • For Yes = 4 / (4 + 2); For No = 7 / (7 + 1) • FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4) • Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2) • F-measure = 2TP / (2TP + FP + FN) • For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11 • For No = 2 * 7 / (2*7 + 2 + 1) = 14/17
In terms of true positives etc • True positives = TP; False positives = FP • True Negatives = TN; False negatives = FN • Recall = TP / (TP + FN) // true positives / actually positive • Precision = TP / (TP + FP) // true positives / predicted positive • F-measure = 2TP / (2TP + FP + FN) • This has been generated using algebra from the formula previous • Easier to understand this way – correct predictions are double counted – once for recall, once for precision. denominator includes corrects and incorrects (either based on recall or precision idea – relevant but not retrieved or retrieved but not relevant) • There is no mathematics that says recall and precision can be combined this way – it is ad hoc – but it does balance the two
WEKA • For many occasions, this borders on “too much information”, but it’s all there • We can decide, are we more interested in Yes , or No? • Are we more interested in recall or precision?
WEKA – with more than two classes • Contact Lenses with Naïve Bayes === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.8 0.053 0.8 0.8 0.8 soft 0.25 0.1 0.333 0.25 0.286 hard 0.8 0.444 0.75 0.8 0.774 none === Confusion Matrix === a b c <-- classified as 4 0 1 | a = soft 0 1 3 | b = hard 1 2 12 | c = none • Class exercise – show how to calculate recall, precision, f-measure for each class
Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/ Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> Promoters The confidence of action rule – 0.993 * 0.849 = 0.84 Our action rule can target only 4.2 (out of 10.2) detractors. So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status