CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data

CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Yiling Yang Xudong Guan Jinyuan You

Outline • Motivation • Objective • Introduction • Clustering With sLOPE • Implementation • Experiments • Conclusions • Personal opinion

Motivation • This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume.

Objective • To present a fast and efficient clustering algorithm CLOPE for transactional data.

Introduction • Clustering is an important data mining technique. • Objective : to group data into sets • Intra-cluster similarity is maximized • Inter-cluster similarity is minimized • The basic idea behind CLOPE • Uses global criterion function that tries to increase the intra-cluster overlapping of transaction items by increasing the height-to-width ratio of the cluster histogram. • A parameter to control the tightness of the cluster.

Introduction

Clustering With SLOPE • The Criterion function can be defined locally or globally. • Locally • The criterion function is built on the pair-wise similarity between transactions. • Globally • Clustering quality is measured in the cluster level,utilizing information like the sets of large and small items in the clustering.

Clustering With SLOPE • We define the size S(C) and width W(C) of a cluster C below: • The height of a cluster is defined as • We use gradient instead of H(C) as the quality measure for cluster C.

Clustering With SLOPE

Clustering With SLOPE Here, r is a positive real number called repulsion, used to control the level of intra-cluster similarity.

Clustering With SLOPE • Problem definition • Given D and r, find a clustering C that maximize . • The sketch of the CLOPE algorithm.

Implementation • RAM data structure • We keep only the current transaction and • A small amount of information for each cluster.The information, called cluster features. • Remark • The total memory required for item occurrences is approximately M*K*4 bytes using array of 4-byte integers. • The computation of profit • Computing the delta value of adding t to C.

Implementation

Implementation • The time and space complexity • Suppose the average length of a transaction is A. • The total number of transactions is N. • The maximum number of clusters is K. • The time complexity for one iteration is O( ). • The space requirement for CLOPE is approximately the memory size of the cluster features.

Experiments • We analyze the effectiveness and execution speed of CLOPE with two real-life datasets. • For effectiveness, we compare the clustering quality of CLOPE on a labeled dataset with those of LargeItem and ROCK. • For execution speed, we compare CLOPE with LargeItem on a large web log dataset.

Mushroom • Mushroom dataset (real-life) • The mushroom dataset from the UCI machine learning repository has been used by both ROCK and LargeItem for effectiveness tests. • It contains 8124 records with two classes,4208 edible mushrooms and 3916 poisonous mushrooms.

Mushroom

Mushroom • The result of CLOPE on mushroom is better than that of LargeItem and close to that of ROCK. • Sensitivity to data order • The results are different but very close to the original ones. • It shows that CLOPE is not very sensitive to the order of input data.

Berkeley web logs • Web log data is another typical category of transactional databases. • The web log files from http://www.cs.berkeley.edu/log/ as the dataset for our second experiment and test the scalability as well as performance of CLOPE. • There are about 7 million entries in the raw log file and 2 million of them are kept after non-html entries removed.

Conclusions • The CLOPE algorithm is proposed based on the intuitive idea of increasing the height-to-width ratio of the cluster histogram. • The idea is generalized with a repulsion parameter that controls tightness of transactions in a cluster. • Experiments show that CLOPE is quite effective in finding interesting clusterings. • Moreover,CLOPE is not very sensitive to data order and requires little domain knowledge in controlling the number of clusters.

Personal Opinion • The idea behind CLOPE is very simple, can help beginner easy to learning.

CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data

CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data

Presentation Transcript

Clustering Algorithms

FAST Exam

Fast Trie Data Structures

Fuzzy C-Means Clustering

Clustering Techniques

Best practices to ensure efficient data models, fast data activation, and performance of your SAP NetWeaver BW 7.3 data

Graph P artitioning a nd Clustering for Community Detection

DATA MINING LECTURE 8

Take your midterm today!

Clustering Documents

CPSC 3200 Algorithm Analysis and Advanced Data Structure

Data Mining Tools

Data Mining-Knowledge Presentation—ID3 algorithm

BIOINFORMATICS Datamining #1

Clustering and NLP

A Brief Overview of Trajectory Data Mining

Software Clustering Using Bunch

Fast Algorithm for String Matching with k Mismatches

Fast Propositional Algorithms for Planning

Hardware Transactional Memory

Dr. Marina Gavrilova