410 likes | 582 Views
Machine Learning and Bioinformatics 機器學習與生物資訊學. Evaluation. The key to success. Three datasets. of which the answers must be known. Note on parameter tuning. It is important that the testing data is not used in any way to create the classifier
E N D
Machine Learning and Bioinformatics機器學習與生物資訊學 Machine Learning & Bioinformatics
Evaluation The key to success Machine Learning and Bioinformatics
Three datasets of which the answers must be known Machine Learning and Bioinformatics
Note on parameter tuning • It is important that the testing data is not used in any way to create the classifier • Some learning schemes operate in two stages • build the basic structure • optimize parameters • Thetesting data can not be used for parameter tuning • proper procedure uses three sets: training,tuning and testing data Machine Learning and Bioinformatics
Data is usually limited • Error on the training data is NOT a good indicator of performance on future data • otherwise 1NN would be the optimum classifier • Not a problem if lots of (answered) data is available • split data into training, turning and testing sets • However, (answered) data is usually limited • More sophisticated techniques need to be used Machine Learning and Bioinformatics
Issues in evaluation • Statistical reliability of estimated differences in performance significancetests • Choice of performance measures • number of correctly classified samples • ratio of correctly classified samples • error in numeric predictions • Costs assigned to different types of errors • many practical applications involve costs Machine Learning and Bioinformatics
Training and testing sets • Testing set mustplay no part, including parameter tuning, in classifier formation • Ideally, both training and testing sets are representative samples of the underlying problem, but they may differ in nature • we got data from two different towns A and B and want to estimate the performance of our classifier in a completely new town Machine Learning and Bioinformatics
Which (training vs. tuning/testing) should be more similar to the target new town? Machine Learning and Bioinformatics
Making the most of the data • Once evaluation is complete, all the data can be used to build the final classifier for real (unknown) data • A dilemma • generally, the larger the training data the better the classifier (but returns diminish) • the larger the testing data the more accurate the error estimate Machine Learning and Bioinformatics
Holdout procedure • Method of splitting original data into training and testing sets • Reserve a certain amount for testing and use the remainder for training • usually one third for testing and the rest for training • The samples might not be representative • e.g., class might be missing in the testing data • Stratification • ensures that each class is represented with approximately equal proportions in both subsets Machine Learning and Bioinformatics
Repeated holdout procedure • Holdout procedure can be made more reliable by repeating the process with different subsamples • in each iteration, a certain proportion is randomly selected for testing (possibly with stratification) • the error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout procedure • A problem is that the different testing sets overlap Machine Learning and Bioinformatics
Cross-validation • Cross-validation avoids overlapping test sets • split data into nsubsets of equal size • use each subset in turn for testing, the remainder for training • the error estimates are averaged to yield an overall error estimate • Called n-foldcross-validation • Often the subsets are stratified before the cross-validation is performed Machine Learning and Bioinformatics
More on cross-validation • Stratified ten-fold cross-validation • Why ten? • extensive experiments have shown that this is the best choice to get an accurate estimate • there is also some theoretical evidence for this • Repeated stratified cross-validation • e.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) Machine Learning and Bioinformatics
Leave-One-Out cross-validation • A particular form of cross-validation • set number of folds to number of training instances • Makes best use of the data • Involves no random subsampling • Very computationally expensive Advantage and disadvantage Machine Learning and Bioinformatics
LOO-CV and stratification • Stratification is not possible • there is only one instance in the testing set • An extremeexample • random dataset split equally into two classes • best inducer predicts majority class • 50% accuracy on fresh data • LOO-CV estimate is 100% error Machine Learning and Bioinformatics
Cost Machine Learning and Bioinformatics
Counting the cost • In practice, different types of classification errors often incur different costs • Examples • terrorist profiling, where predicting ‘negative’ achieves 99.99% accuracy • loan decisions • oil-slick detection • fault diagnosis • promotional mailing Machine Learning and Bioinformatics
Confusion matrix Machine Learning and Bioinformatics
Classification with costs • Two cost matrices • Error rate is replaced by average cost per prediction Machine Learning and Bioinformatics
Cost-sensitive learning • A basicidea is to only predict high-cost class when very confident about the prediction • Instead predicting the most likely class, we should make the prediction that minimizes the expected cost • dot product of class probabilities and appropriate column in cost matrix • choose column (class) that minimizes expected cost • Not at training time • Most learning schemes do not perform cost sensitive learning • they generate the same classifier no matter what costs are assigned to the different classes Machine Learning and Bioinformatics
A simple method for cost-sensitive learning Machine Learning and Bioinformatics
Resampling of instances according to costs Machine Learning and Bioinformatics
Measures Machine Learning and Bioinformatics
Lift charts • In practice, costs are rarely known • Decisions are usually made by comparing possible scenarios • E.g., promotional mail to 1,000,000 households • mail to all; 0.1% respond (1000) • a data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) • another tool identifies subset of 400,000 most promising, 0.2% respond (800) • Which is better? • A lift chart allows a visual comparison Machine Learning and Bioinformatics
Generating a lift chart • Sort instances according to predicted probability of being positive • x-axis is sample size; y-axis is number of true positives Machine Learning and Bioinformatics
A hypothetical lift chart Machine Learning and Bioinformatics
ROC curves • ROC curves are similar to lift charts • stands for “receiver operating characteristic” • used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel • Differences to lift chart • y-axis shows percentage of true positives in sample rather than absolute number • x-axis shows percentage of false positives in sample rather than sample size Machine Learning and Bioinformatics
A sample ROC curve Jagged curve one set of test data Smooth curve use cross-validation Machine Learning and Bioinformatics
More measures • Precision = , percentage of reported samples that are positive • Recall = , percentage of positive samples that are reported • Precision/recallcurves have hyperbolic shape • Three-point average is the average precision at 20%, 50% and 80% recall • F-measure = , harmonic mean of precision and recall • makes precision and recall as equal as possible • Specificity = , percentage of negative samples that are not reported • Area under the ROC curve (AUC) Machine Learning and Bioinformatics
Summary of some measures Machine Learning and Bioinformatics
Evaluating numeric prediction Same strategies, including independent testing sets,cross-validation, significance tests, etc. Machine Learning and Bioinformatics
Measures in numeric prediction • Actual target values: • Predicted target values: • The most popular measure is mean squared error (MSE), , because it is easy to manipulate mathematically Machine Learning and Bioinformatics
Other measures • Root mean squarederror (RMSE) = • Mean absolute error (MAE), , is less sensitive to outliers than MSE • Sometimes relative error values are more appropriate Machine Learning and Bioinformatics
Improvement on the mean • How much does the scheme improve on simply predicting the average? • Relative squared error = • Relative absolute error = Machine Learning and Bioinformatics
Correlation coefficient / 相關係數 • Measures the statistical correlation between the predicted values and the actual values • Scale independent, between –1 and +1 • Good performance leads to large values Machine Learning and Bioinformatics
http://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gifhttp://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gif
Which measure? • Best to look at all of them • Often it doesn’t matter • D the best; C the second-best; A and B are arguable Machine Learning and Bioinformatics
Today’s exercise Machine Learning & Bioinformatics
Parameter tuning Design your own select, feature, buy and sell programs. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 11/5 (Mon). Machine Learning & Bioinformatics
Possible ways • Enlarge parameter range in CV • Stratified, repeated… • minimize the variance • Make turning set • use large training set; make tuning set as similar to the target stocks as possible • Cost matrix • resampling, otherwise it would be very difficult • Change measures • or plot ROC curves to understand your classifiers • The best measure is the transaction profit, but it requires the simulation system. Instead, you can develop a comprising evaluation script, which is more complicated than any theoretic measures but simpler than the real problem. This is usually required in practice. Machine Learning and Bioinformatics