100 likes | 293 Views
Classification by Machine Learning Approaches - Exercise Solution. Michael J. Kerner – kerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical University of Denmark. Exercise Solution:. donors_trainset.arff - All features: trees.J48 === Stratified cross-validation ===
E N D
Classification by Machine Learning Approaches- Exercise Solution Michael J. Kerner – kerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical University of Denmark
Exercise Solution: • donors_trainset.arff - All features: • trees.J48 • === Stratified cross-validation === • === Summary === • Correctly Classified Instances 4972 94.5967 % • Incorrectly Classified Instances 284 5.4033 % • Kappa statistic 0.8381 • === Detailed Accuracy By Class === • TP Rate FP Rate Precision Recall F-Measure Class • 0.87 0.034 0.875 0.87 0.872 true • 0.966 0.13 0.965 0.966 0.966 false • === Confusion Matrix === • a b <-- classified as • 971 145 | a = true • 139 4001 | b = false
Exercise Solution: • donors_trainset.arff - All features: • bayes.NaiveBayes • === Stratified cross-validation === • === Summary === • Correctly Classified Instances 4910 93.417 % • Incorrectly Classified Instances 346 6.583 % • Kappa statistic 0.8056 • === Detailed Accuracy By Class === • TP Rate FP Rate Precision Recall F-Measure Class • 0.862 0.046 0.834 0.862 0.848 true • 0.954 0.138 0.962 0.954 0.958 false • === Confusion Matrix === • a b <-- classified as • 962 154 | a = true • 192 3948 | b = false
Exercise Solution: • donors_trainset.arff - All features: • functions.SMO • === Stratified cross-validation === • === Summary === • Correctly Classified Instances 4986 94.863 % • Incorrectly Classified Instances 270 5.137 % • Kappa statistic 0.8455 • === Detailed Accuracy By Class === • TP Rate FP Rate Precision Recall F-Measure Class • 0.871 0.03 0.885 0.871 0.878 true • 0.97 0.129 0.965 0.97 0.967 false • === Confusion Matrix === • a b <-- classified as • 972 144 | a = true • 126 4014 | b = false
Exercise Solution: donors_trainset.arff Binary Feature Encoding donors_trainset_diffencod.arff Fewer featuresFour (nominal) values per feature • @RELATION donors.train • @ATTRIBUTE -7_A {0,1} • @ATTRIBUTE -7_T {0,1} • @ATTRIBUTE -7_C {0,1} • [...] • @ATTRIBUTE 6_A {0,1} • @ATTRIBUTE 6_T {0,1} • @ATTRIBUTE 6_C {0,1} • @ATTRIBUTE 6_G {0,1} • @ATTRIBUTE class {true,false} • @DATA • 0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,true • 0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,true • [...] • @RELATION donors.train • @ATTRIBUTE -7 {A,C,G,T} • @ATTRIBUTE -6 {A,C,G,T} • @ATTRIBUTE -5 {A,C,G,T} • @ATTRIBUTE -4 {A,C,G,T} • [...] • @ATTRIBUTE +3 {A,C,G,T} • @ATTRIBUTE +4 {A,C,G,T} • @ATTRIBUTE +5 {A,C,G,T} • @ATTRIBUTE +6 {A,C,G,T} • @ATTRIBUTE splicesite {true,false} • @DATA • C,T,C,C,G,A,A,A,G,G,A,T,T,true • T,C,A,G,A,A,G,G,A,G,G,G,C,true • T,T,G,G,A,A,G,T,C,G,C,A,G,true • [..]
Exercise Solution: • donors_trainset_diffencod.arff - All features: • trees.J48 • === Stratified cross-validation === • === Summary === • Correctly Classified Instances 4948 94.14 % • Incorrectly Classified Instances 308 5.86 % • Kappa statistic 0.8248 • === Detailed Accuracy By Class === • TP Rate FP Rate Precision Recall F-Measure Class • 0.862 0.037 0.862 0.862 0.862 true • 0.963 0.138 0.963 0.963 0.963 false • === Confusion Matrix === • a b <-- classified as • 962 154 | a = true • 154 3986 | b = false
Exercise Solution: • donors_trainset_diffencod.arff - All features: • bayes.NaiveBayes • === Stratified cross-validation === • === Summary === • Correctly Classified Instances 4922 93.6454 % • Incorrectly Classified Instances 334 6.3546 % • Kappa statistic 0.8078 • === Detailed Accuracy By Class === • TP Rate FP Rate Precision Recall F-Measure Class • 0.834 0.036 0.862 0.834 0.848 true • 0.964 0.166 0.956 0.964 0.96 false • === Confusion Matrix === • a b <-- classified as • 931 185 | a = true • 149 3991 | b = false
Exercise Solution: • donors_trainset_diffencod.arff - All features: • functions.SMO • === Stratified cross-validation === • === Summary === • Correctly Classified Instances 4986 94.863 % • Incorrectly Classified Instances 270 5.137 % • Kappa statistic 0.8456 • === Detailed Accuracy By Class === • TP Rate FP Rate Precision Recall F-Measure Class • 0.872 0.031 0.885 0.872 0.878 true • 0.969 0.128 0.966 0.969 0.967 false • === Confusion Matrix === • a b <-- classified as • 973 143 | a = true • 127 4013 | b = false
Exercise Solution: • Feature Selection: • CfsSubsetEval, BestFirst: • Features -2A, -1G, 1A, 2A, 3_G • CorrelationCoefficients: • J48: 0.7981 • NaiveBayes: 0.7762 • SMO: 0.7388 • MultilayerPerceptron: 0.8053 • ClassifierSubsetEval (w/ NaiveBayes), BestFirst: • Features: -7A, -7C, -6G, -4A, -1G, 1A, 1T, 1C, 2A, 3G, 4T, 5A • CorrelationCoefficients: • J48: 0.7935 • NaiveBayes: 0.8033 • SMO: 0.7597 • MultilayerPerceptron: 0.7765
Summary • Generally, there is no ‘best’ method for all problems. • Feature representation can influence classification results. • Feature selection often improves classification performance, but not always. • Feature selection significantly speeds up classification – thereby allowing also computationally very demanding classifiers Always try to test multiple methods!