Scalable Automatic k-Determination for Large-Scale Data Clustering

Multiobjective Clustering with Automatic k-determination for Large-scale Data Presenter : Shao-Wei Cheng Authors : Nobukazu Matake, Tomoyuki Hiroyasu, Mitsunori Miki, Tomoharu Senda CECCO 2007

Outline • Motivation • Objective • Methodology • Original MOCK • New scalable k-determination scheme • Experiments and Results • Conclusion • Personal Comments

Motivation • Web behavior mining has attracted a great deal of attention today. • MOCK is powerful and strict. But the computational costs are too high when applied to clustering huge data. Too Much Data !!

Objectives • Apply MOCK to web data clustering with a scalable automatic k-determination scheme. • Determine the appropriate k at low cost. • It contains two complementary objectives. • Determination of appropriate k. • Find partitions between k clusters.

Methodology • Original MOCK Third Step First Step Forth Step Second Step Gap statistic

Methodology • New scalable k-determination scheme First Step Second Step First scheme：Calculate adjacent angles x y Second scheme x x

Experiments

Conclusion • The new scheme is able to determine the appropriate k at low cost, although the performance is poorer than the original algorithm. • Reduce the Pareto size by about 50-70%. • Doesn’t need random data clustering.

Personal Comments • Advantage • MOCK can be applied to large-scale data. • Drawback • Application • Web data.

Scalable Automatic k-Determination for Large-Scale Data Clustering

Scalable Automatic k-Determination for Large-Scale Data Clustering

Presentation Transcript

Very Large-Scale Incremental Clustering

Large Scale Data Visualization with VisIt

Large-Scale Data Processing with MapReduce

Automatic Wrappers for Large Scale Web Extraction

Large scale genomic data mining

Large-scale Single-pass k-Means Clustering at Scale

Automatic Wrappers for Large Scale Web Extraction

Data Indexing for Stateful , Large-scale Data Processing

Large scale data processing

Large-Scale Automatic Classification of Phishing Pages

Semi-supervised Relation Extraction with Large-scale Word Clustering

Large Scale Data Processing with DryadLINQ

Unstructured Data Partitioning for Large Scale Visualization

Large Scale Data Integration

Large Scale Data Analytics

large scale data analysis

Automatic Wrappers for Large Scale Web Extraction

Computational Mathematics for Large-scale Data Analysis

Very Large-Scale Incremental Clustering

New Challenges for Large-scale Data Storage