Performance Evaluation in Computer Vision

Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park

Contents • Error Estimation in Pattern Recognition • Jain et al., “Statistical Pattern Recognition: A Review”, IEEE PAMI 2000 (Section 7 Error Estimation). • Assessing and Comparing Algorithms • Adrian Clark and Christine Clark, “Performance Characterization in Computer Vision: A Tutorial”. • Receiver Operating Characteristic (ROC) curve • Detection Error Trade-off (DET) curve • Confusion Matrix • McNemar’s test http://peipa.essex.ac.uk/benchmark/

Error Estimation in Pattern Recognition • Reference - Jain et al., “Statistical Pattern Recognition: A Review”, IEEE PAMI 2000 (Section 7 Error Estimation). • It is very difficult to obtain a closed-form expression for error rate Pe. • In practice, the error rate must be estimated from all the available samples split into training and test sets. • Error estimate = percentage of misclassified test samples. • Reliable error estimate – (1) Large sample size, (2) Independent training and test samples.

Error Estimation in Pattern Recognition • The error estimate (function of the specific training and test sets used) is random variable. • Given a classifier, t is # of misclassified test samples out of n.  The probability density function of t has a binomial distribution. • The maximum-likelihood estimate, Pe, of Pe is given by Pe=t/n, with E(Pe) = Pe and Var(Pe) = Pe(1- Pe)/n. • Pe is a random variable  a confidence interval (shrink as n increases)

leave all in versions of cross- validation approach resampling based on the analogy population  sample sample  sample http://www.uvm.edu/~dhowell/StatPages/Resampling/Bootstrapping.html http://www.childrens-mercy.org/stats/ask/bootstrap.asp http://www.cnr.colostate.edu/class_info/fw663/bootstrap.pdf http://www.maths.unsw.edu.au/ForStudents/courses/math3811/lecture9.pdf

Error Estimation in Pattern Recognition • Receiver Operating Characteristic (ROC) Curve  detailed later. • ‘Reject Rate’: reject doubtful patterns near the decision boundary (low confidence). • A well-known reject option is to reject a pattern if its maximum a posteriori probability is below a threshold. • Trade-off between ‘reject rate’ and ‘error rate’.

Next seminar: Dimensionality Reduction/Manifold Learning ?

classification method

Assessing and Comparing Algorithms • Reference: Adrian Clark and Christine Clark, “Performance Characterization in Computer Vision: A Tutorial”. • http://peipa.essex.ac.uk/benchmark/tutorials/essex/tutorial.pdf • The same training and test sets. Some standard sets – FERET, PETS. • Simply to see which has the better success rate?  Not enough. A standard statistical test, McNemar’s test is required. • Two types of testing: • Technology evaluation: the response of an underlying generic algorithm to factors such as adjustment of its tuning parameters, noisy input date, etc. • Application evaluation: how well an algorithm performs a particular task

Assessing and Comparing Algorithms • Receiver Operating Characteristic (ROC) curve

Assessing and Comparing Algorithms • Detection Error Trade-off (DET) curve • logarithmic scales on both axes • more spread out, easier to distinguish • close to linear

Assessing and Comparing Algorithms • Detection Error Trade-off (DET) curve • Forensic applications: track down a suspect • High security applications: ATM machines • EER (equal error rate) • Comparisons of algorithms tend to be performed • with a specific set of tuning parameter values • (Running them with settings that correspond to • the EER is probably the most sensible.)

Assessing and Comparing Algorithms • Crossing ROC curves Comparisons of algorithms tend to be performed with a specific set of tuning parameter values (Running them with settings that correspond to the EER is probably the most sensible.)

Assessing and Comparing Algorithms • Confusion Matrices

Assessing and Comparing Algorithms • McNemar’s test An appropriate statistical test must take into account not only # of FP, etc. but also ‘# of tests’. (a form of chi-square test) http://www.zephryus.demon.co.uk/geography/resources/fieldwork/stats/chi.html http://www.isixsigma.com/dictionary/Chi_Square_Test-67.htm

Assessing and Comparing Algorithms • McNemar’s test If # of tests > 30, the central limit theorem applies

Performance Evaluation in Computer Vision