360 likes | 629 Views
Using Clustering to Enhance Classifiers. Christoph F. Eick Organization of the Talk Brief Introduction to KDD Using Clustering for Nearest Neighbour Editing for Distance Function Learning for Class Decomposition Representative-Based Supervised Clustering Algorithms
E N D
Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk Brief Introduction to KDD Using Clustering for Nearest Neighbour Editing for Distance Function Learning for Class Decomposition Representative-Based Supervised Clustering Algorithms Summary and Conclusion
Objectives of Today’s Presentation • Goal: To give you a flavor what kind of questions and techniques are investigated by my/our current research • Brief introduction to KDD • Not discussed: • Why is KDD/classification/clustering important? • Example applications for KDD/classification/clustering. • Evaluation of presented techniques (if you are interested how techniques presented in this presentation compare with other approaches you can read [VAE03], [EZZ04], [ERBV04], [EZV04], [RE05]). • Literature survey
1. Knowledge Discovery in Data [and Data Mining] (KDD) • Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) • Frequently, the term data mining is used to refer to KDD. • Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html) Let us find something interesting!
KDD: Confluence of Multiple Disciplines Database Technology Statistics KDD Machine Learning Visualization Information Science Other Disciplines
Popular KDD-Tasks • Classification (learn how to classify) • Clustering (finding groups of similar objects) • Estimation and Prediction (try to learn a function that predicts the value of a continuous output variable based on a set of input variables) • Deviation and Fraud Detection • Concept description: Characterization and Discrimination • Trend and Evolution Analysis • Mining for Associations and Correlations • Text Mining • Web Mining • Visualization • Data Transformation and Data Cleaning • Data Integration and Data Warehousing
Important KDD Conferences • KDD (has 500-900 participants, strong industrial presence, KDD-Cup, controlled by ACM) • ICDM (receives approx. 500 papers each year, controlled by IEEE) • PKDD (European KDD Conference)
2. Clustering for Classification Assumption: We have a data set containing classified examples Goal: We want to learn a function (a classifier) that classifies an example based on its characteristics (attributes) Example: http://www2.cs.uh.edu/~wxstrong/AI/nba.data http://www2.cs.uh.edu/~wxstrong/AI/nba.names Topic for the next 40 minutes: Presentation of 3 different approaches that use clustering to obtain better classifier.
List of Persons that Contributed to the Work Presented in Today’s Presentation • Tae-Wan Ryu • Ricardo Vilalta • Murali Achari • Alain Rouhana • Abraham Bagherjeiran • Chunshen Chen • Nidal Zeidat • Zhenghong Zhao
Nearest Neighbour Rule Consider a two class problem where each sample consists of two measurements (x,y). For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3 Problem: requires “good” distance function
2a. Dataset Reduction: Editing • Training data may contain noise, overlapping classes • Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries • Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly) • Secondary Goal of Editing: enhance the speed of a k-NN classifier
Wilson Editing • Wilson 1972 • Remove points that do not agree with the majority of their k nearest neighbours Earlier example Overlapping classes Original data Original data Wilson editing with k=7 Wilson editing with k=7
Traditional Clustering • Partition a set of objects into groups of similar objects. Each group is called cluster. • Clustering is used to “detect classes” in data set (“unsupervised learning”). • Clustering is based on a fitness function that relies on a distance measure and usually tries to create “tight” clusters.
Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).
Representative-Based Supervised Clustering (RSC) • Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. • The remaining objects in the data set are, then, clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.
Representative-Based Supervised Clustering … (Continued) 2 Attribute1 1 3 Attribute2 4
Representative-Based Supervised Clustering … (continued) 2 Attribute1 1 3 Attribute2 4 Objective of RSC: Find a subset OR of O such that the clustering X obtained by using the objects in OR as representatives minimizes q(X).
RSC Dataset Editing Attribute1 Attribute1 B A D C F E Attribute2 Attribute2 a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives.
General Direction of this Research p Data Set’ Data Set IDLA IDLA Classifier C Classifier C’ Goal: Find p, such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly.
2b. Using Clustering in Distance Function Learning Example: How to Find Similar Patients? The following relation is given (with 10000 tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) • Attribute Domains • ssn: 9 digits • weight between 30 and 650; mweight=158 sweight=24.20 • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2 • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor • eye-color: {brown, blue, green, grey } • age: between 3 and 100; mage=45 sage=13.2 Task: Define Patient Similarity
CAL-FULL/UH Database Clustering & Similarity Assessment Environments Training Data A set of clusters Library of clustering algorithms Learning Tool Object View Similarity measure Clustering Tool Library of similarity measures Similarity Measure Tool Data Extraction Tool User Interface Today’s topic Type and weight information Default choices and domain information DBMS For more details: see [RE05]
Similarity Assessment Framework and Objectives • Objective: Learn a good distance function q for classification tasks. • Our approach: Apply a clustering algorithm with the distance function q to be evaluated that returns a number of clusters k. The more pure the obtained clusters are the better is the quality of q. • Our goal is to learn the weights of an object distance function q such that all the clusters are pure (or as pure is possible); for more details see [ERBV04] paper.
Idea: Coevolving Clusters and Distance Functions Weight Updating Scheme / Search Strategy Clustering X Distance Function Q Cluster “Bad” distance function Q1 “Good” distance function Q2 q(X) Clustering Evaluation o o o x x o x o o o x o o o Goodness of the Distance Function Q o o x x x x x x
Idea Inside/Outside Weight Updating o:=examples belonging to majority class x:= non-majority-class examples Cluster1: distances with respect to Att1 xo oo ox Action: Increase weight of Att1 Cluster1: distances with respect to Att2 Idea: Move examples of the majority class closer to each other o o xx o o Action: Decrease weight for Att2
Sample Run of IOWU for Diabetes Dataset Graph produced by Abraham Bagherjeiran
Research Framework Distance Function Learning Distance Function Evaluation Weight-Updating Scheme / Search Strategy K-Means Random Search Supervised Clustering Randomized Hill Climbing Other Work Inside/Outside Weight Updating NN-Classifier … …
2.c Using Clustering for Class Decomposition Attribute1 Ford Trucks :Ford :GMC GMC Trucks GMC Van Ford Vans Ford SUV Attribute2 GMC SUV
RSC Enhance Simple Classifiers Attribute1 A B C D Attribute2
3. SC Algorithms Currently Investigated • Supervised Partitioning Around Medoids (SPAM). • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). • Top Down Splitting Algorithm (TDS). • Supervised Clustering using Evolutionary Computing (SCEC) • Agglomerative Hierarchical Supervised Clustering (AHSC).
A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0
Applications of Supervised Clustering • Enhance classification algorithms. • Use SC for Dataset Editing to enhance NN-classifiers [ICDM04] • Improve Simple Classifiers [ICDM03] • Learning Sub-classes • Distance Function Learning [ERBV04] • Dataset Compression/Reduction • Redistricting • Meta Learning / Creating Signatures for Datasets
4. Summary • We gave a brief introduction to KDD • We demonstrated how clustering can be used to obtain “better” classifiers • We introduced a new form of clustering, called supervised clustering, for this purpose.
Research Topics 2004-2005 • Inductive Learning/Data Mining • Decision trees, nearest neighbor classifiers • Using clustering to enhance classification algorithms • Making sense of data • Supervised Clustering • Learning subclasses • Supervised clustering algorithms that learn clusters with arbitrary shape • Redistricting algorithms • Tools for Similarity Assessment and Distance Function Learning • Data Set Compression and Creating Meta Knowledge for Local Learning Techniques • Comparative study involving traditional editing and condensing and unusual techniques • Creating maps and other data set signatures for datasets based on editing, SC, and other techniques • Traditional Clustering • Data Mining and Information Retrieval for Structured Data • Other: Evolutionary Computing, File Prediction, Ontologies, Heuristic Search, Reinforcement Learning, Data Models. Remark: Topics that were “covered” in this talk are in blue
Where to Find References? • Data mining and KDD (SIGKDD member CDROM): • Conference proceedings: KDD, ICDM, PKDD etc. • Journal: Data Mining and Knowledge Discovery • Database field (SIGMOD member CD ROM): • Conference proceedings: ACM-SIGMOD, VLDB, ICDE, EDBT, DASFAA • Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc. • AI and Machine Learning: • Conference proceedings: ICML, AAAI, IJCAI, etc. • Journals: Machine Learning, Artificial Intelligence, etc. • Statistics: • Conference proceedings: Joint Stat. Meeting, etc. • Journals: Annals of statistics, etc. • Visualization: • Conference proceedings: CHI, etc. • Journals: IEEE Trans. visualization and computer graphics, etc.
Links to 5 Papers • [VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: • A New Framework for Low-Variance Classifiers, in Proc. IEEE International • Conference on Data Mining (ICDM), Melbourne, Florida, November 2003. • http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf • [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and • Benefits, short version of this paper to appear in Proc. International Conference on • Tools with AI (ICTAI), Boca Raton, Florida, November 2004. • http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf • [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to • Learn Distance Functions for Supervised Similarity Assessment, in revision, to be • submitted to MLDM'05, Leipzig, Germany, July 2005 • http://www.cs.uh.edu/~ceick/kdd/ERBV04.pdf • [EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering • for Nearest Neighbor Dataset Editing, to appear in Proc. IEEE International • Conference on Data Mining (ICDM), Brighton, England, November 2004. • http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf • [RE05]. Ryu, C. Eick, A Database Clustering Methodology and Tool, to appear in Information Science, • Spring 2005. • http://www.cs.uh.edu/~ceick/kdd/RE05.doc
Work at UH Weight Adjustment within a Cluster Let wi be the current weight of the i-th attribute Let si be the average distance of the examples that belong to the cluster with respect to fi Let mi be the distance of examples that belong to the majority class of the cluster with respect to fi Learning: Then weights are adjusted as follows with respect to a particular cluster: wi’=wi+ (si – mi) *a or better wi’=wi+ wi*min(max(-b,(si – mi) *a),b) with a being the learning rate and b maximal adjustment (e.g. if b=0.2 a weight can be maximally increased/decreased by 20%) per weight per cluster. Remark: If the cluster is ‘pure’ or does not contain 2 or more elements of a particular class, no weight adjustment takes place.