8. Evaluation Methods

8. Evaluation Methods Errors and Error Rates Precision and Recall Similarity Cross Validation Various Presentations of Evaluation Results Statistical Tests Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

How to evaluate/estimate error • Resubstitution • one data set used for both training and for testing • Holdout (training and testing) • 2/3 for training, 1/3 for testing • Leave-one-out • If a data set is small • Cross validation • 10-fold, why 10? • m 10-fold CV Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Error and Error Rate • Mean and Median • mean = 1/nxi • weighted mean = (wixi)/wi • median = x(n+1)/2 if n is odd, else (xn/2+x(n/2)+1)/2 • Error – disagreement btwn y and y’ (predicted) • 1 if they disagree, 0 otherwise (0-1 loss l01) • Other definitions depending on the output of a predictor such as quadratic loss l2, absolute loss l‖ Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Error estimation • Error rate e = #Errors/N, where N is the total number of instances • Accuracy A = 1 - e Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

R P Precision and Recall • False negative and false positive • Types of errors for k classes = k2-k • k = 3, 3*3-3 = 6, k = 2, 2*2-2 = 2 • Precision (wrt the retrieved) • P = TP/(TP+FP) • Recall (wrt the total relevant) • R = TP/(TP+FN) • Precision×Recall (PR) and PR gain • PR gain = (PR’ – PR0)/PR0 • Accuracy • A = (TP+TN)/(TP+TN+FP+FN) Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Similarity or Dissimilarity Measures • Distance (dissimilarity) measures (Triangle Inequality) • Euclidean • City-block, or Manhattan • Cosine (pi,pj)= [(pikpjk)/ (pik)2(pjk)2] • Inter-clusters and intra-clusters • Single linkage vs. complete linkage • Dmin = min|pi - pj|, two data points • Dmax= max|pi - pj| • Centroid methods • Davg= 1/(ninj)|pi – pj| • Dmean= |mi - mj|, two means Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Fold 2 Fold 1 Fold 3 k-Fold Cross Validation • Cross validation • 1 fold for training, the rest for testing • rotate until every fold is used for training • calculate average • mk-fold cross validation • reshuffle data, repeat XV for m times • what is a suitable k? • Model complexity • use of XV • tree complexity, training/testing error rates Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Learning (happy) curves Accuracy increases over X Its opposite (or error) decreases over X Box-plot Whiskers (min, max) Box: confidence interval Graphical equivalent of t-test max 2 mean min Presentations of Evaluation Results Results are usually about time, space, trend, average case Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Statistical Tests • Null hypothesis and alternative hypothesis • Type I and Type II errors • Student’s t test comparing two means • Paired t test comparing two means • Chi-Square test • Contingency table Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Null Hypothesis • Null hypothesis (H0) • No difference between the test statistic and the actual value of the population parameter • E.g., H0:  = 0 • Alternative hypothesis (H1) • It specifies the parameter value(s) to be accepted if the H0 is rejected. • E.g., H1:  != 0 – two-tailed test • OrH1:  > 0 – one-tailed test Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Type I, II errors • Type I errors () • Rejecting a null hypothesis when it is true (FN) • Type II errors () • Accepting a null hypothesis when it is false (FP) • Power = 1 –  • Costs of different errors • A life-saving medicine appears to be effective, which is cheap and has no side effect (H0: non-effective) • Type I error: it is effective, not costly • Type II error: it is non-effective, very costly Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Test using Student’s t Distribution • Use t distribution for testing the difference between two population means is appropriate if • The population standard deviations are not known • The samples are small (n < 30) • The populations are assumed to be approx. normal • The two unknown 1 = 2 • H0: (1 - 2) = 0, H1: (1 - 2) != 0 • Check the difference of estimated means normalized by common population means • degree of freedom and p level of significance • df = n1 + n2 – 2 Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

-/2 0 +/2 Paired t test • With paired observations, use paired t test • Now H0: d = 0 and H1: d != 0 • Check the estimated difference mean • The t in previous and current cases are calculated differently. • Both are 2-tailed test, p = 1% means .5% on each side • Excel can do that for you! Rejection Region Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Chi-Square Test (thegoodness-of-fit) • Testing a null hypothesis that the population distribution for a random variable follows a specified form. • The chi-square statistic is calculated: • degree of freedom df = k-m-1 • k = num of data categories • m = num of parameters estimated • 0 – uniform, 1- Poisson, 2 - normal • Each cell should be at least 5 • One-tail test 2 k 2 =  (Aij – Eij)2 / Eij i=1 j=1 Rejection Region Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

Bibliography • W. Klosgen & J.M. Zytkow, edited, 2002, Handbook of Data Mining and Knowledge Discovery. Oxford University Press. • L. J. Kazmier & N. F. Pohl, 1987. Basic Statistics for Business and Economics. • R.E. Walpole & R.H. Myers, 1993. Probability and Statistics for Engineers and Scientists (5th edition). MACMILLAN Publishing Company. Data Mining – Evaluation H. Liu (ASU) & G Dong (WSU)

8. Evaluation Methods