270 likes | 433 Views
Evaluating Performance for Data Mining Techniques. Evaluating Numeric Output. Mean absolute error (MAE) Mean square error (MSE) Root mean square error (RMSE). Mean Absolute Error (MAE). The average absolute difference between classifier predicted output and actual output.
E N D
Evaluating Numeric Output Mean absolute error (MAE) Mean square error (MSE) Root mean square error (RMSE)
Mean Absolute Error (MAE) The average absolute difference between classifier predicted output and actual output.
Mean Square Error (MSE) The average of the sum of squared differences between classifier predicted output and actual output.
Root Mean Square Error (RMSE) The square root of the mean square error.
Clustering Techniques • Clustering Techniques apply some measure of similarity to divide instances of the data to be analyzed into disjoint partitions • The partitions are generalized by computing a group mean for each cluster or by listing a most typical subset of instances from each cluster
Clustering Techniques • 1st approach: unsupervised clustering • 2nd approach: to partition data in a hierarchical fashion where each level of the hierarchy is a generalization of the data at some level of abstraction.
The K-Means Algorithm • The K-means algorithm is a simple (but widely used) statistical clustering technique, which is used for unsupervised clustering • The K-means algorithm divides instances of the data to be analyzed into disjoint K partitions (clusters). • Proposed by S.P. Lloyd in 1957, first published in 1982.
The K-Means Algorithm Choose a value for K, the total number of clusters. Randomly choose K points as cluster centers. Assign the remaining instances to their closest cluster center (for example, using Euclidian distance as a criterion). Calculate a new cluster center for each cluster. Repeat steps 3-5 until the cluster centers do not change.
The K-Means Algorithm: Analysis • Choose a value for K, the total number of clusters – this step requires an initial discussion about how many clusters can be distinguished within a data set
The K-Means Algorithm: Analysis • Randomly choose K points as cluster centers – the initial cluster centers are selected randomly, but this is not essential if K was chosen properly; the resulting clustering in this case should not depend on the selection of the initial cluster centers
The K-Means Algorithm: Analysis • Calculate a new cluster center for each cluster – new cluster centers are the means of the cluster members that were placed to their clusters in the previous step
The K-Means Algorithm: Analysis • Repeat steps 3-5 until the cluster centers do not change – the process instance classification and cluster center computation continues until an iteration of the algorithm shows no change in the cluster centers. • The algorithm terminates after j iterations if for each cluster Ci all instances found in Ci after iteration j-1 remain in cluster Ci upon the completion of iteration j
Euclidian Distance Euclidian distance between two n-dimensional vectors is determined as
Cluster Quality • How we can evaluate the cluster quality, its reliability? • One evaluation method, which is more suitable for the clusters of about equal size, is to calculate the sum of square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.
Cluster Quality • Another evaluation method is to calculate the mean square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.
Optimal Clustering Criterion • Clustering is considered optimal, when the average (taken over all clusters) mean square deviation of the cluster members from their center is either: • minimal over several (s) experiments • or less than some predetermined acceptable value
An Example Using the K-Means Algorithm
The K-Means Algorithm:General Considerations • Requires real-valued data. • We must select the number of clusters present in the data. • Works best when the clusters that exist in the data are of approximately equal size. If an optimal solution is represented by clusters of unequal size, the K-Means algorithm is not likely to • Attribute significance cannot be determined. • A supervised data mining tool must be used to gain into the nature of the clusters formed by a clustering tool.
Supervised Learning for Unsupervised Model Evaluation • Designate each formed cluster as a class and assign each class an arbitrary name. • Choose a random sample of instances from each class for supervised learning. • Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model.