SCALE: a scalable framework for efficiently clustering transactional data

SCALE: a scalable framework for efficiently clustering transactional data Hua Yan · Keke Chen · Ling Liu · Zhang Yi DMKD 2010 Reported by Wen-Chung Liao, 2010/03/02

Outlines • Motivation • Objective • WCD clustering • Evaluating clustering results • Experiments • Conclusions • Comments

Motivation • transactional data clustering algorithms require users to manually tune at least one or two parameters • lacks of cluster validation methods to evaluate the quality of transactional clustering results.

Objectives • Present a fast, memory-saving, and scalable clustering algorithm that can efficiently handle large transactional datasets without resorting to manual parameter settings. • SCALE framework

WCD clustering • transactional dataset • {abcd, bcd, ac, de, def}

Evaluating clustering results

T10I4Dx Experiment • two synthetic datasets: • Tc30a6r1000_2L • TxI4Dx Series T10I4Dx TxI4D100k • Three real datasets: • Zoo • Mushroom • Retail

Tc30a6r1000_2L Zoo

Conclusion • Two unique features of SCALE • the WCD clustering algorithm—a fast, memory-saving and scalable method for clustering transactional data, • two transactional data specific cluster evaluation measures: LISR and AMI. • Some promising directions • perform some experimental comparison between the WCD measure and the entropy measure. • design a better algorithm for determining the best K for transactional data clustering. • Extend our work to handle transactional data streams

Comments • Advantage • No parameter setting required • Shortage • If there is no BKPlot, WCD needs to determine K manually. • No description of how BKPlot generates K in categorical case. • Applications • Transactions clustering • Web log clustering • …

SCALE: a scalable framework for efficiently clustering transactional data