Project Presentation CPSC 695

Project PresentationCPSC 695 Prepared By: Priyadarshi Bhattacharya

Outline of Talk • Introduction to clustering and its relevance to my research interests. • Discussion on existing clustering techniques and their shortcomings. • Introduction to a new Delaunay based clustering algorithm. • Experimental Results and comparison with other methods. • Direction of future research.

Clustering – Definition • Automatic identification of groups of similar objects. • A method of grouping data such that intracluster similarity is maximized and intercluster similarity is minimized.

Properties of clustering • Scalability: Clustering performance should decrease linearly with data size increase • Ability to detect clusters of different shapes • Minimal input parameter • Robust with regard to noise • Insensitive to data input order • Scalability to higher dimensions (properties referred from “On Data Clustering Analysis: Scalability, Constraints and Validation” with minor modifications)

Relevance to my research • Identification of high-risk areas in the sea based on incident data from the Maritime Activity and Risk Investigation System (MARIS), maintained primarily by the University of Halifax. Marine Route Planning Clustering Algorithm Incident Data High-risk areas (ESRI Shape File) Location of SAR Bases

Existing clustering algorithms Clustering Partitioning Hierarchical Density-based Grid-based K-Means, K-Medoid BIRCH, CURE, ROCK, CHAMELEON DBSCAN, TURN* WaveCluster1, CLIQUE 1WaveCluster: A novel clustering approach based on wavelet transforms. Applies a multi-resolution grid structure on the data space. For more details, refer to “Wavecluster: a multi-resolution clustering approach for very large spatial databases”, Proc. 24th Conf. on Very Large Databases.

Shortcomings of existing methods • Require large number of parameters to be input by user. Example – number of clusters, threshold to quantify “similarity”, stopping condition, number of nearest neighbors etc. • Sensitivity to user-supplied parameters. • Capability of identifying clusters degrades with increase in noise. • Inability to identify clusters of widely varying shapes and sizes. Most detect spherical ones only. • Identification of dense clusters in presence of sparse ones, clusters connected by multiple bridges, closely lying dense clusters remains elusive.

CRYSTAL – A new Delaunay based clustering algorithm The algorithm has 3 stages : • Triangulation phase: Forms the Delaunay Triangulation of the data points and sorts the vertices in the order of decreasing average length of adjacent edges. • Grow cluster phase: Scans the sorted vertex list and grows clusters from the vertices in that order, first encompassing first order neighbors, then second order neighbors and so on. The growth stops when the boundary of the cluster is determined. • Noise removal phase: The algorithm identifies noise as sparse clusters. They can be easily eliminated by removing clusters which are very small in size or which have a very low density.

Description of stage I • Triangulation phase: • Triangulation is done in O(nlogn) time using the incremental algorithm. • An auxiliary grid structure (O(n) in size) is used to speed up the point location problem in the Delaunay Triangulation. This considerably reduces length of walk in the graph to locate the triangle containing the data point. • The well-known Winged-Edge data-structure is used to represent the Delaunay Triangulation because of its efficiency in answering proximity queries.

Description of Stage II • Grow Cluster phase: A queue is used to maintain a list of vertices in order, from which the cluster is grown. Only vertices that are not boundary points are inserted into the queue. To decide whether a point belongs to the cluster, the edge length is compared with the average edge length of the cluster. To decide whether a point is on the boundary of a cluster, the average adjacent edge length of the point is compared to the average edge length of the cluster.

Description of Stage III • Noise Removal Phase: Noise in the data may be in the form of isolated data points or scattered throughout the data. In the former case, cluster based at these data points will not be able to grow. However, if the noise is scattered uniformly throughout the data, our algorithm identifies it as a single sparse cluster. This phase simply gets rid of noise by eliminating the cluster with the highest average edge length. Also any trivial clusters (size less than an acceptable number) are removed in this phase.

Complexity Analysis • The algorithm operates in O(nlogn) time. Delaunay Triangulation is generated in O(nlogn) time. As a vertex once assigned to a cluster is not considered again, the clustering is done in O(n) time. Cluster size (1000) Vs Time consumed (ms)

Clustering in action

Experimental Results Comparison with K-Means based approaches

Experimental Results (contd.) 1. Clusters of different shapes 2. Closely lying dense clusters

Experimental Results (contd.) 1. Clusters connected by multiple bridges 2. Clusters of widely varying density

Experimental Results (contd.) Data set K-Means GEM CRYSTAL

Experimental Results (contd.) Results on t7.10k.dat (originally used in “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”)

Conclusion & Future Work CRYSTAL is a fast O(nlogn) clustering algorithm that automatically identifies clusters of widely varying shapes, sizes and densities without requiring any input from user. Future work will involve: • Application of the clustering algorithm in identification of high-risk areas in the sea using the MARIS database. • Extension of the algorithm to 3D. • Considering physical constraints in clustering. In GIS, physical constraints such as rivers, highways, mountain ranges can hinder or alter the clustering result.

References • G. Papari, N. Petkov: Algorithm That Mimics Human Perceptual Grouping of Dot Patterns. Lecture Notes in Computer Science (2005) 497-506 • Vladimir Estivill-Castro, Ickjai Lee: AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets. Fifth International Conference on Geocomputation (2000) • Osmar R. Zaiane, Andrew Foss, Chi-Hoon Lee, Weinan Wang: On Data Clustering Analysis: Scalability, Constraints and Validation. Advances in Knowledge Discovery and Data Mining, Springer-Verlag (2002 ) • Z.S.H. Chan, N. Kasabov: Efficient global clustering using the Greedy Elimination Method. Electronics Letters 40 25 (2004 ) • Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek: The global k-means clustering algorithm. Pattern Recognition 36 2 (2003 ) 451-461 • Ying Xu, Victor Olman, Dong Xu: Minimum Spanning Trees for Gene Expression Data Clustering. Computational Protein Structure Group, Life Sciences Division, Oak Ridge National Laboratory, USA • C. Eldershaw, M. Hegland: Cluster Analysis using Triangulation. Computational Techniques and Applications CTAC97, 201-208. World Scientific, Singapore, 1997 • Mir Abolfazl Mostafavi, Christopher Gold, Maciej Dakowicz: Delete and insert operations in Voronoi/Delaunay methods and applications. Computers \& Geosciences 29 4 523-530 (2003) • Atsuyuki Okabe, Barry Boots, Kokichi Sugihara: Spatial Tessellations: Concepts and Applications of Voronoi Diagrams.

Thank You! All 11 identified by CRYSTAL! Questions?

Project Presentation CPSC 695

Project Presentation CPSC 695

Presentation Transcript

Cpsc 527 Project List

NLM-CPSC-NAO Project SHARE Introduction

CPSC 695

CPSC 695 Week 6

CPSC 641 Course Project Ideas

EDLS 695

Project Work and the CPSC Library

CPSC Public Presentation Catalogue Page Slide #1

EDLS 695

CPSC 410 Project

cpsc

CPSC 203 ITBL Project

CPSC 601.38: Project Brainstorming Session

MAE 695-06

ECE 695 Project Presentation Clustering Sensor Network using Genetic Algorithm

CPSC 594B: Software Engineering Project

SB 695

695

ECE 695 Project Presentation Clustering Sensor Network using Genetic Algorithm

SB 695