430 likes | 443 Views
Learn about various performance metrics used in computational intelligence, including percent correct, average sum-squared error, and normalized error. Explore the selection of "gold standards," specification of training sets, and the role of decision threshold levels. Discover the effectiveness measures of evolutionary algorithms and the Mann-Whitney U Test.
E N D
Chapter 10 Performance Metrics
Introduction • Sometimes, measuring how well a system is performing is relatively straightforward: we calculate a “percent correct” • Other performance metrics are not often discussed in the literature • Issues such as selection of data are reviewed • Choices and uses of specific performance metrics are discussed
Issues to be Addressed • Selection of “gold standards” • Specification of training sets (sizes, # iterations, etc.) • Selection of test sets • 4. Role of decision threshold levels
Computational Intelligence Performance Metrics Percent correct Average sum-squared error Normalized error Evolutionary algorithm effectiveness measures Mann-Whitney U Test Receiver operating characteristic curves Recall, precision, sensitivity, specificity, etc. Confusion matrices, cost matrices Chi-square test
Selecting the “Gold Standard” Issues: 1. Selecting the classification * do experts agree? * involve end users 2. Selecting representative pattern sets * agreed to by experts * distributed over classes appropriately (some near decision hypersurfaces) 3. Selecting person or process to designate gold standard * involve end users * the customer is always right
Specifying Training Sets • Use different sets of patterns for training and testing • Rotate patterns through training and testing, if possible • Use “leave-n-out” method • Select pattern distribution appropriate for paradigm • > equal number in each class for back-propagation • > according to probability distribution for SOFM
Percent Correct • Most commonly used metric • Can be misleading: • >Predicting 90% which will exceed Dow avg, and 60% • those that won’t, where .5 in each category, results • in 75% correct • >Prediction of 85% of those which exceed and 55% of • those that won’t, where .7 in first cat., .3 in second, • results in 76% correct
Calculating Average Sum-Squared Error Total error: Average error: Note: Inclusion of .5 factor not universal Number of output PEs not always taken into account Dividing by no. of output PEs desirable to get results that can be compared This metric is used with other CI paradigms State your method when publishing results
Selection of Threshold Values • Value is often set to 0.5, usually without scientific basis • Another value may give better performance • Example: 10 patterns, threshold = .5, 5 should be on and 5 off (1 output PE) • If on always .6, off always .4, avg. SSE = .16, 100% correct • If on always .9, off .1 for 3 & .6 for 2, SSE=.08, 80% correct • .05 .9 for 5 (on) • .03 .1 for 3 (off) • .72 .6 for 2 (off) • .80/10 = .08 SSE • Perhaps calculate SSE only for errors, and use threshold • as desired value
Absolute Error • More intuitive than sum-squared error • Mean absolute error: • Above formulation is for neural net; metric also • useful for other paradigms using optimization, such • as fuzzy cognitive maps • Max. Abs. Error
Removing Target Variance Effects • Variance: Avg. of squared deviations from the mean • Standard SSE is corrupted by target variance: • Pineda developed normalized error ENORM using EMEAN, • which is constant for a given pattern set
Calculating Normalized Error First, calculate total error (previous slide) and mean error EMEAN • Now, normalized error ENORM = ETOTAL/ EMEAN • Watch out for PEs that don’t change value (mean error = 0); • perhaps add .000001 to mean error to be safe • This metric reflects output variance due to error rather than error due to neural network architecture.
Evolutionary Algorithm Effectiveness Metrics (DeJong 1975) Offline performance - measure of convergence * Online performance Where G is the latest generation, is best fitness for system s in generation g.
Mann-Whitney U Test • Also known as Mann-Whitney-Wilcoxon test or Wilcoxon rank sum test • Used for analyzing performance of evolutionary algorithms • Evaluates similarity of medians of two data samples • Uses ordinal data (continuous; can tell which of two is greater)
Two Samples: n1and n2 • Sample sizes need not be the same size • Can often get significant results (.05 level or better) with fewer than 10 samples in each group • We focus on calculating U for relatively small sample sizes
Analyze Best Fitness of Runs • Assume minimization problem, best value = 0.0 • Two configurations, A and B (baseline) with different configurations (maybe different mutation rates) • We obtain the following best fitness values from 5 runs of A and 4 runs of B: A: .079, .062, .073, .047, .085 (n1 = 5) B: .102, .069, .055, .049 (n2 = 4) • There are two ways to calculate U • Quick and direct • Formula (PC statistical packages)
Quick and Direct Method Arrange measurements in fitness order: Count number of As that are better than each B: U1 = 1 +1 + 2 + 5 = 9 Count number of Bs that are better than each A: U2 = 0 + 2 + 3 + 3 + 3 = 11 Now, U = min[U1, U2] = 9 Note: U2 = n1n2 – U1
Formula Method Calculate R, the sum of ranks, for each n. R1 = 1 + 4 + 6 + 7 + 8 = 26 R2 = 2 + 3 + 5 + 0 = 19 Now, U is the smaller of:
Is Null Hypothesis Rejected? If U is less than or equal to the value in the table, the null hypothesis is rejected at the .05 level. 9 > 2, so it is NOT rejected. Thus, we cannot say one configuration results in significantly higher fitness than the other.
Now Test Configuration C We obtain the following, ignoring specific fitness values since rank is what matters: No. of Bs better than each C is 0 + 0 + 0 + 0 + 2 = 2 = U Now, null hypothesis rejected at .05 level C is statistically better than B Note: This test can be used for other systems using variety of fitness measures such as percent correct.
Receiver Operating Characteristic Curves * Originated in 1940s in communications systems and psychology * Now being used in diagnostic systems and expert systems * ROC curves not sensitive to probability distribution of training or or test set patterns or decision bias * Good for one class at a time (one output PE) * Most common metric is the area under the curve
Contingency Matrix System Diagnosis Gold Standard Diagnosis Recall is TP/(TP + FN) Precision is TP/(TP+FP)
Contingency Matrix * Reflects the four possibilities from one PE or output class * ROC curve makes use of two ratios of these numbers True pos. ratio = TP/(TP+FN) = sensitivity False pos. ratio = FP/(FP+TN) = 1 - specificity * Major diagonal of curve represents situation where no discrimination exists Note: Specificity = TN/(FP + TN)
Plotting the ROC Curve • Plot for various values of thresholds, or, • Plot for various output values of the PE • Probably need about 10 values to get resolution • Calculate area under curve using trapezoidal rule Note: Calculate each value in contingency matrix for each threshold or output value.
ROC Curve Interpretation Along the dotted line, no discrimination exists. The system can achieve this performance solely by chance. A perfect system has a true positive ratio of one, and a false positive ratio of zero, for some threshold.
ROC Cautions * Two ROC curves with same area can intersect - one is better on false positives, the other on false negatives * Use a sufficient number of cases * Might want to investigate behavior near other output PE values
Recall and Precision Recall: The number of positive diagnoses correctly made by the system divided by the total number of positive diagnoses made by the gold standard (true positive ratio) Precision: The number of positive diagnoses correctly made by the system divided by the total number of positive diagnoses made by the system
Sensitivity and Specificity Sensitivity = TP/(TP+FN) - Likelihood event is detected given that it is present Specificity = TN/TN+FP) - Likelihood absence of event is detected given that it is absent Pos. predictive value = TP/(TP+FP) - Likelihood that detection of event is associated with event False alarm rate = FP/(FP+TN) - Likelihood of false signal detection given absence of event
Criteria for Correctness • If outputs are mutually exclusive, “winning” PE is PE with largest activation value • If outputs are not mutually exclusive, then a fixed threshold criterion (e.g. 0.5) can be used.
Confusion Matrices • Useful when system has multiple output classes • For n classes, n by n matrix constructed • Rows reflect the “gold standard” • Columns reflect system classifications • Entry represents a count (a frequency of occurrence) • Diagonal values represent correct instances • Off-diagonal values represent row misclassified as column
Using Confusion Matrices • Initially work row-by-row to calculate “class confusion” values • by dividing each entry by total count in row (each row sums to 1) • Now have “class confusion” matrix • Calculate “average percent correct” by summing diagonal values and dividing by number of classes n. (This isn’t true percent correct unless all classes have same prior probability.)
To Calculate Cost Matrix • Must know prior probabilities of each class • Multiply each element by prior probability for class (row) • Now each value in matrix is probability of occurrence; all sum to one. This is the confusion matrix. • Multiply each element in matrix by its cost (diagonal costs often are zero, but not always) This is the cost matrix. • Cost ratios can be used; subjective measures cannot be • Use cost matrix to fine-tune system (threshold values,membership functions, etc.)
Minimizing Cost An evolutionary algorithm can be used to find a lowest-cost system; the cost matrix output is thus the fitness function. Sometimes, it is sufficient to just minimize the sum of off-diagonal numbers.
Example: Medical Diagnostic System Three possible diagnoses: A, B, and C; 50 cases of each diagnosis for training and also for testing. Prior probabilities are .60, .30, and .10, respectively. Test results: CI System Diagnoses A B C Gold A 40 8 2 Standard Diagnoses B 6 42 2 C 1 1 48
Class confusion matrix: CI System Diagnoses A B C Gold A 0.80 0.16 0.04 Standard Diagnoses B 0.12 0.84 0.04 C 0.02 0.02 0.96 Final confusion matrix (includes prior probabilities): CI System Diagnoses A B C Gold A .480 .096 .024 Standard Diagnoses B .036 .252 .012 C .002 .002 .096
Sum of main diagonal gives system accuracy of 82.8 percent. Costs of correct diagnoses: $10, $100, and $5,000. Misdiagnoses: A as B: $100 + $10 B as A: $10 + $100 B as C: $5,000 + $100 (plus angry patient) C as A or B: $80,000+($10 or $100 for A or B) (plus lawsuits) Final cost matrix for this system configuration: CI System Diagnoses A B C Gold A 4.80 10.56 120.24 Standard Diagnoses B 3.96 25.20 61.20 C 160.02 160.20 480.00 Average application of system thus costs $1,026.18.
Chi-Square Test What you can use if you don’t know what the results are supposed to be * Can be useful in modeling, simulation, or pattern generation, such as music composition * Chi-square test examines how often each category (class) occurs versus how often it’s expected to occur E is expected frequency; O is observed frequency n is number of categories This test assumes normally distributed data. Threshold values play only an indirect role.
Chi-square test case • Four PE’s (four output categories), so there are 3 degrees of freedom. • In test case of 50 patterns, expected frequency distribution is: 5, 10, 15, 20 • Test 1 results: 4, 10, 16, 20. Chi-square = 0.267 • Test 2 results: 2, 15 9, 26. Chi-square = 8.50
Chi-Square Example Results For first test set, hypothesis of no difference between expected and obtained (null hypothesis) is sustained at .95 level This means that it is over 95% probable that differences are due solely to chance For second test set, the null hypothesis is rejected at the .05 level. This means that the probability is less than 5% that the differences are due solely to chance
Using Excel™ to Calculate Chi-square To find probability that null hypothesis is sustained or rejected, use =CHIDIST(X2, df), where X2 is the chi-square value and df is the degrees of freedom. Example: CHIDIST(0.267,3) yields an answer of 0.966, so the null hypothesis is sustained at the .966 level. Generate chi-square values (as in a table) with =CHIINV(p, df) Example: CHIINV(.95, 3) yields 0.352, which is the value in a table.
Chi-Square Summary • Chi-square measures whole system performance at once • Watch out for: • Output combinations not expected (frequencies of 0) • Systems with large number of degrees of freedom • (most tables limited to 30-40 or so) • You are, of course, looking for small chi-square values!