1 / 53

CSci 8980: Data Mining (Fall 2002)

CSci 8980: Data Mining (Fall 2002). Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar. Model Evaluation. Metrics for Performance Evaluation How to evaluate the performance of a model?

Download Presentation

CSci 8980: Data Mining (Fall 2002)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar

  2. Model Evaluation • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates • Methods for Model Comparison • How to compare the relative performance among competing models

  3. Metrics for Performance Evaluation • Focus on the predictive capability of a model • Rather than how fast it takes to classify or build models, scalability, etc. • Confusion Matrix: a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

  4. Metrics for Performance Evaluation… • Most widely-used metric:

  5. Cost Matrix C(i|j): Cost of misclassifying class j example as class i • Accuracy is a useful measure if • C(Yes|No)=C(No|Yes) and C(Yes|Yes)=C(No|No) • P(Yes) = P(No) (class distribution are equal)

  6. Cost vs Accuracy Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

  7. Cost-Sensitive Measures • Precision is biased towards C(Yes|Yes) & C(Yes|No) • Recall is biased towards C(Yes|Yes) & C(No|Yes) • F-measure is biased towards all except C(No|No)

  8. Methods for Performance Evaluation • How to obtain a reliable estimate of performance? • Performance of a model may depend on other factors besides the learning algorithm: • Class distribution • Cost of misclassification • Size of training and test sets

  9. Learning Curve • Learning curve shows how accuracy changes with varying sample size • Requires a sampling schedule for creating learning curve: • Arithmetic sampling(Langley, et al) • Geometric sampling(Provost et al) Effect of small sample size: • Bias in the estimate • Variance of estimate

  10. Methods of Estimation • Holdout • Reserve 2/3 for training and 1/3 for testing • Random subsampling • Repeated holdout • Cross validation • Partition data into k disjoint subsets • k-fold: train on k-1 partitions, test on the remaining one • Leave-one-out: k=n • Stratified sampling • oversampling vs undersampling • Bootstrap • Sampling with replacement

  11. ROC (Receiver Operating Characteristic) • Developed in 1950s for signal detection theory to analyze noisy signals • Characterize the trade-off between positive hits and false alarms • ROC curve plots TP (on the y-axis) against FP (on the x-axis) • Performance of each classifier represented as a point on the ROC curve • changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point

  12. ROC Curve - 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=0.88

  13. ROC Curve (TP,FP): • (0,0): declare everything to be negative class • (1,1): declare everything to be positive class • (1,0): ideal • Diagonal line: • Random guessing • Below diagonal line: • prediction is opposite of the true class

  14. Using ROC for Model Comparison • No model consistently outperform the other • M1 is better for small FPR • M2 is better for large FPR • Area Under the ROC curve • Ideal: • Area = 1 • Random guess: • Area = 0.5

  15. How to Construct an ROC curve • Use classifier that produces posterior probability for each test instance P(+|A) • Sort the instances according to P(+|A) in decreasing order • Apply threshold at each unique value of P(+|A) • Count the number of TP, FP, TN, FN at each threshold • TP rate, TPR = TP/(TP+FN) • FP rate, FPR = FP/(FP + TN)

  16. How to construct an ROC curve Threshold >= ROC Curve:

  17. Test of Significance • Given two models: • Model M1: accuracy = 85%, tested on 30 instances • Model M2: accuracy = 75%, tested on 5000 instances • Can we say M1 is better than M2? • How much confidence can we place on accuracy of M1 and M2? • Can the difference in performance measure be explained as a result of random fluctuations in the test set?

  18. Confidence Interval for Accuracy • Prediction can be regarded as a Bernoulli trial • A Bernoulli trial has 2 possible outcomes • Possible outcomes for prediction: correct or wrong • Collection of Bernoulli trials has a Binomial distribution: • x  Bin(N, p) x: number of correct predictions • e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50  0.5 = 25 • Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)?

  19. Confidence Interval for Accuracy Area = 1 -  • For large test sets (N > 30), • acc has a normal distribution with mean p and variance p(1-p)/N • Confidence Interval for p: Z/2 Z1-  /2

  20. Confidence Interval for Accuracy • Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: • N=100, acc = 0.8 • Let 1- = 0.95 (95% confidence) • From probability table, Z/2=1.96

  21. Comparing Performance of 2 Models • Given two models, say M1 and M2, which is better? • M1 is tested on D1 (size=n1), found error rate = e1 • M2 is tested on D2 (size=n2), found error rate = e2 • Assume D1 and D2 are independent • If n1 and n2 are sufficiently large, then • Approximate:

  22. Comparing Performance of 2 Models • To test if performance difference is statistically significant: d = e1 – e2 • d ~ N(dt,t) where dt is the true difference • Since D1 and D2 are independent, their variance adds up: • At (1-) confidence level,

  23. An Illustrative Example • Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 • d = |e2 – e1| = 0.1 (2-sided test) • At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant

  24. Comparing Performance of 2 Algorithms • Each learning algorithm may produce k models: • L1 may produce M11 , M12, …, M1k • L2 may produce M21 , M22, …, M2k • If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation) • For each set: compute dj = e1j – e2j • dj has mean dt and variance t • Estimate:

  25. What is Cluster Analysis? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. • Based on information found in the data that describes the objects and their relationships. • Also known as unsupervised classification. • Many applications • Understanding: group related documents for browsing or to find genes and proteins that have similar functionality. • Summarization: Reduce the size of large data sets.

  26. What is not Cluster Analysis? • Supervised classification. • Have class label information. • Simple segmentation. • Dividing students into different registration groups alphabetically, by last name. • Results of a query. • Groupings are a result of an external specification. • Graph partitioning • Some mutual relevance and synergy, but areas are not identical.

  27. Notion of a Cluster is Ambiguous Initial points. Six Clusters Two Clusters Four Clusters

  28. Types of Clusterings • A clustering is a set of clusters. • One important distinction is between hierarchical and partitional sets of clusters. • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree.

  29. Partitional Clustering Original Points A Partitional Clustering

  30. Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

  31. Other Distinctions Between Sets of Clusters • Exclusive versus non-exclusive • In non-exclusive clusterings, points may belong to multiple clusters. • Can represent multiple classes or ‘border’ points • Fuzzy versus non-fuzzy • In fuzzy clusterings, a point belongs to every cluster with some weight between 0 and 1. • Weights must sum to 1. • Probabilistic clustering has similar characteristics. • Partial versus complete. • In some cases, we only want to cluster some of the data.

  32. Types of Clusters: Well-Separated • Well-Separated Clusters: • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

  33. Types of Clusters: Center-Based • Center-based • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster. • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster.

  34. Types of Clusters: Contiguity-Based • 3) Contiguous Cluster(Nearest neighbor or Transitive) • A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

  35. Types of Clusters: Density-Based • Density-based • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. • Used when the clusters are irregular or intertwined, and when noise and outliers are present. • The three curves don’t form clusters since they fade into the noise, as does the bridge between the two small circular clusters.

  36. Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1] • Dissimilarity • Numerical measure of how different two data objects are. • Is lower when objects are more alike. • Minimum dissimilarity is often 0. • Upper limit varies • Proximity refers to a similarity or dissimilarity

  37. Summary of Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

  38. Euclidean Distance • Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. • Standardization is necessary, if scales differ.

  39. Euclidean Distance Distance Matrix

  40. Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

  41. Minkowski Distance: Examples • r = 1. City block (Manhattan, taxicab, L1 norm) distance. • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors. • r = 2. Euclidean distance. • r. “supremum” (Lmax norm, Lnorm) distance. • This is the maximum difference between any component of the vectors. • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

  42. Minkowski Distance Distance Matrix

  43. Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. • d(p, q)  0 for all p and q and d(p, q) = 0 only if p= q. (Positive definiteness) • d(p, q) = d(q, p) for all p and q. (Symmetry) • d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric

  44. Common Properties of a Similarity • Similarities, also have some well known properties. • s(p, q) = 1 (or maximum similarity) only if p= q. • s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.

  45. Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes. • Compute similarities using the following quantities M01= the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00= the number of attributes where p was 0 and q was 0 M11= the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)

  46. SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01= 2 (the number of attributes where p was 0 and q was 1) M10= 1 (the number of attributes where p was 1 and q was 0) M00= 7 (the number of attributes where p was 0 and q was 0) M11= 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

  47. Cosine Similarity • If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1d2) / ||d1|| ||d2|| , where  indicates vector dot product and || d || is the length of vector d. • Example: d1= 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6) 0.5 = 2.245 cos( d1, d2 ) = .3150

  48. Extended Jaccard Coefficient (Tanimoto) • Variation of Jaccard for continuous or count attributes • Reduces to Jaccard for binary attributes

  49. Correlation • Correlation measure the linear relationship between objects. • To compute correlation, we standardize data objects, p and q, and then take the dot product.

  50. Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1.

More Related