200 likes | 357 Views
Using Representative-Based Clustering For Nearest Neighbour Dataset Editing. Christoph F. Eick, Nidal Zeidat, Ricardo Vilalta Department of Computer Science, University of Houston, Texas, USA Organization of the Talk Dataset Editing and Condensing Representative-based Supervised Clustering
E N D
Using Representative-Based Clustering For Nearest Neighbour Dataset Editing Christoph F. Eick, Nidal Zeidat, Ricardo Vilalta Department of Computer Science, University of Houston, Texas, USA Organization of the Talk Dataset Editing and Condensing Representative-based Supervised Clustering Experimental Results Applications of Supervised Clustering Summary and Conclusion
1. Introduction Nearest Neighbour Editing Consider a two class problem where each sample consists of two measurements (x,y). For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3
Dataset Reduction: Editing • Training data may contain noise, overlapping classes • Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries • Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly) • Secondary Goal of Editing: enhance the speed of a k-NN classifier
Figure provided by David Claus Wilson Editing • Remove points that do not agree with the majority of their k nearest neighbours • Therefore, only points that are classified incorrectly are removed Earlier example Overlapping classes Original data Original data Wilson editing with k=7 Wilson editing with k=7
Figure provided by David Claus Dataset Reduction: Condensing • Aim is to reduce the number of training samples more speed • Retain only the samples that are needed to define the decision boundary • Tends to remove example that are classified correctly by a k-NN classifier • Decision Boundary Consistent – a subset whose nearest neighbour decision boundary is identical to the boundary of the entire training set • Minimum Consistent Set – the smallest subset of the training data that correctly classifies all of the original training data Original data Minimum Consistent Set
Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).
2. Representative-Based Supervised Clustering (RSC) • Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. • The remaining objects in the data set are, then, clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.
Representative-Based Supervised Clustering … (Continued) 2 Attribute1 1 3 Attribute2 4
Representative-Based Supervised Clustering … (Continued) 2 Attribute1 1 3 Attribute2 4 Objective of RSC: Find a subset OR of O such that the clustering X obtained by using the objects in OR as representatives minimizes q(X).
RSC Dataset Editing Attribute1 Attribute1 B A D C F E Attribute2 Attribute2 a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives.
A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0
SC Algorithms Currently Investigated • Supervised Partitioning Around Medoids (SPAM). • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). • Top Down Splitting Algorithm (TDS). • Supervised Clustering using Evolutionary Computing (SCEC) • Agglomerative Hierarchical Supervised Clustering (AHSC).
3. Experimental Evaluation • We compared a traditional 1-NN, 1-NN using Wilson Editing, Supervised Clustering Editing (SCE), and C4.5 (that was run using its default parameter setting). • A benchmark consisting of 8 UCI datasets was used for this purpose. • Accuracies were computed using 10-fold cross validation. • SRIDHCR was used for supervised clustering. • SCE was tested using different compression rates by associating different penalties with the number of clusters found (by setting parameter b to: 0.1, 0.4 and 1.0). • Compression rates of SCE and Wilson Editing were computed using: 1-(k/n) with n being the size of the original dataset and k being the size of the edited dataset.
Table 3: Dataset Compression Rates for SCE and Wilson Editing.
4. Applications of Supervised Clustering • Enhance classification algorithms. • Use SC for Dataset Editing to enhance NN-classifiers • Improve Simple Classifiers • Learning Sub-classes • Distance Function Learning • Dataset Compression/Reduction • Redistricting • Meta Learning / Creating Signatures for Datasets
4. Summary • Wilson editing enhances the accuracy of a traditional 1-NN classifier for six of the eight datasets tested. It achieved compression rates of approx. 25%, but much lower compression rates for “easy” datasets. • SCE achieved very high compression rates without loss in accuracy for 6 of the 8 datasets tested. • SCE accomplished a significant improvement in accuracy for 3 of the 8 datasets tested. • Surprisingly, many UCI datasets can be compressed by just using a single representative per class without a significant loss in accuracy. • SCE tends to pick representatives that are in the center of a region that is dominated by a single class; it removes examples that are classified correctly as well as examples that are classified incorrectly from the dataset. This explains its much higher compression rates.
Current Direction of this Research p Data Set’ Data Set IDLA:= Inductive Learning Algorithm IDLA IDLA Classifier C Classifier C’ Goal: Find p, such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly. Currently Investigated: Different editing techniques and techniques that originate from high-performance clustering algorithms (e.g. CURE).
Links to 4 Related Papers • [VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: • A New Framework for Low-Variance Classifiers, in Proc. IEEE International • Conference on Data Mining (ICDM), Melbourne, Florida, November 2003. • http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf • [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and • Benefits, short version of this paper to appear in Proc. International Conference on • Tools with AI (ICTAI), Boca Raton, Florida, November 2004. • http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf • [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to • Learn Distance Functions for Supervised Similarity Assessment, in revision, to be • submitted to MLDM'05, Leipzig, Germany, July 2005 • http://www.cs.uh.edu/~ceick/kdd/ERBV04.pdf • [EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering • for Nearest Neighbor Dataset Editing, to appear in Proc. IEEE International • Conference on Data Mining (ICDM), Brighton, England, November 2004. • http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf
Figure provided by David Claus Multi-edit • Diffusion: divide data into N ≥ 3 random subsets • Classification: Classify Si using 1-NN with S(i+1)Mod N as the training set (i = 1..N) • Editing: Discard all samples incorrectly classified in (2) • Confusion: Pool all remaining samples into a new set • Termination: If the last I iterations produced no editing then end; otherwise go to (1) • Multi-edit [Devijer & Kittler ’79] • Repeatedly apply Wilson editing to random partitions • Classify with the 1-NN rule • Approximates the error rate of the Bayes decision rule Multi-edit, 8 iterations – last 3 same