Evaluating Performance for Data Mining Techniques

Evaluating Performancefor Data Mining Techniques

Evaluating Numeric Output Mean absolute error (MAE) Mean square error (MSE) Root mean square error (RMSE)

Mean Absolute Error (MAE) The average absolute difference between classifier predicted output and actual output.

Mean Square Error (MSE) The average of the sum of squared differences between classifier predicted output and actual output.

Root Mean Square Error (RMSE) The square root of the mean square error.

Clustering Techniques

Clustering Techniques • Clustering Techniques apply some measure of similarity to divide instances of the data to be analyzed into disjoint partitions • The partitions are generalized by computing a group mean for each cluster or by listing a most typical subset of instances from each cluster

Clustering Techniques • 1st approach: unsupervised clustering • 2nd approach: to partition data in a hierarchical fashion where each level of the hierarchy is a generalization of the data at some level of abstraction.

Clustering Techniques

The K-Means Algorithm • The K-means algorithm is a simple (but widely used) statistical clustering technique, which is used for unsupervised clustering • The K-means algorithm divides instances of the data to be analyzed into disjoint K partitions (clusters). • Proposed by S.P. Lloyd in 1957, first published in 1982.

The K-Means Algorithm Choose a value for K, the total number of clusters. Randomly choose K points as cluster centers. Assign the remaining instances to their closest cluster center (for example, using Euclidian distance as a criterion). Calculate a new cluster center for each cluster. Repeat steps 3-5 until the cluster centers do not change.

The K-Means Algorithm: Analysis • Choose a value for K, the total number of clusters – this step requires an initial discussion about how many clusters can be distinguished within a data set

The K-Means Algorithm: Analysis • Randomly choose K points as cluster centers – the initial cluster centers are selected randomly, but this is not essential if K was chosen properly; the resulting clustering in this case should not depend on the selection of the initial cluster centers

The K-Means Algorithm: Analysis • Calculate a new cluster center for each cluster – new cluster centers are the means of the cluster members that were placed to their clusters in the previous step

The K-Means Algorithm: Analysis • Repeat steps 3-5 until the cluster centers do not change – the process instance classification and cluster center computation continues until an iteration of the algorithm shows no change in the cluster centers. • The algorithm terminates after j iterations if for each cluster Ci all instances found in Ci after iteration j-1 remain in cluster Ci upon the completion of iteration j

Euclidian Distance Euclidian distance between two n-dimensional vectors is determined as

Cluster Quality • How we can evaluate the cluster quality, its reliability? • One evaluation method, which is more suitable for the clusters of about equal size, is to calculate the sum of square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

Cluster Quality • Another evaluation method is to calculate the mean square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

Optimal Clustering Criterion • Clustering is considered optimal, when the average (taken over all clusters) mean square deviation of the cluster members from their center is either: • minimal over several (s) experiments • or less than some predetermined acceptable value

An Example Using the K-Means Algorithm

Unsupervised Model Evaluation

The K-Means Algorithm:General Considerations • Requires real-valued data. • We must select the number of clusters present in the data. • Works best when the clusters that exist in the data are of approximately equal size. If an optimal solution is represented by clusters of unequal size, the K-Means algorithm is not likely to • Attribute significance cannot be determined. • A supervised data mining tool must be used to gain into the nature of the clusters formed by a clustering tool.

Supervised Learning for Unsupervised Model Evaluation • Designate each formed cluster as a class and assign each class an arbitrary name. • Choose a random sample of instances from each class for supervised learning. • Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model.

Evaluating Performance for Data Mining Techniques

Evaluating Performance for Data Mining Techniques

Presentation Transcript

High Performance Data Mining

Data Mining Techniques Clustering

High Performance Data Mining

Data Mining: Concepts and Techniques Mining Text Data

Data Mining Techniques

CS6220: Data Mining Techniques

Data Mining Techniques for CRM

Evaluating FERMI features for Data Mining Applications

Basic Data Mining Techniques

Data Mining Techniques for CRM

Data Mining: Concepts and Techniques Mining data streams

Data Mining Techniques for Query Relaxation

Basic Data Mining Techniques

Data Mining Techniques

Data Mining: Concepts and Techniques Mining data streams

Machine Learning Techniques for Data Mining

Data Mining Techniques

Data Mining Techniques for Query Relaxation