Very Large-Scale Incremental Clustering

Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007

Table of Contents • Why Clustering? • Why Incremental Clustering? • Related Work • Incremental C3M (C2ICM) • A Former Implementation of C2ICM for very large datasets • Conclusion

Why clustering ? • It is an effective tool to manage information overload • To browse large document collections quickly • To easily grasp the distinct topics and subtopics (concept hierarchies) • To allow search engines to efficiently query large document collections

Types of Clustering • Hierarchical vs. Non-hierarchical • Partitional vs. Agglomerative • Deterministic vs. Probabilistic algorithms • Incremental vs. Batch algorithms

Why Incremental Clustering ? • The current information explosion • Popular sources of informational text documents such as Newswire and Blogs • Delay would be unacceptable in several important areas

Related Work • The cluster-splitting approach • Adaptive clustering based on user queries • Cobweb algorithm • Hierarchical Clustering in Incremental manner

C2ICM Algorithm • C3M is known as an efficient, effective and robust algorithm for clustering documents • C3M is well-developed for initial clustering, but maintenance is also necessary in clustering

C2ICM Algorithm • C2ICM algorithm is based on cover coefficient concept as C3M. • C2ICM is suitable for dynamic environments where there are additions and deletions of documents • With C2ICM, reclustering for each update is avoided.

C2ICM Algorithm Details • First we compute the number of clusters and cluster seed powers in the updated database • Then we determine the newly added documents and falsified documents

C2ICM Algorithm Details • How do the clusters become false? • When a seed document becomes non-seed or is deleted • One or more non-seed documents of that cluster becomes seed

C2ICM Algorithm Details • We cluster these documents by assigning them to the cluster of the seed that covers them most • The documents which does not belong to any cluster are grouped into ragbag cluster

C2ICM: An example • Current state of the clusters Seed List d1 d6 d12 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d18 d16 d17 d12 d13 d14 d19 Ragbag cluster

C2ICM: CASE 1 • When a seed document becomes nonseed Old Seed List d1 d6 d12 New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 New documents arrived d18 d16 d17 d12 d13 d14 d19 d20 d21 d22 The set of documents to be clustered

C2ICM: CASE 1 • Seed document d12 becomes nonseed New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d22 d13 d14 d12 d16 d17 d18 d19 d20 d21 The set of documents to be clustered

C2ICM: CASE 1 • Final clusters New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d20 d16 d12 d13 d18 d21 d14 d17 d19 d22 No elements remaining in the ragbag cluster

C2ICM: CASE 2 • When a nonseed document in a cluster becomes seed Old Seed List d1 d6 d12 New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 New documents arrived d18 d16 d17 d12 d13 d14 d19 d20 d21 d22 The set of documents to be clustered

C2ICM: CASE 2 • Nonseed document d14 becomes seed. New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 Becomes new seed d12 d13 d14 d16 d17 d18 d19 d20 d21 d22 The set of documents to be clustered

C2ICM: CASE 2 • Final clusters New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d20 d16 d13 d12 d22 d18 d21 d19 d17 d14 No elements remaining in the ragbag cluster Becomes new seed

A Former Implementation of C2ICM for Very Large Datasets • C2ICM is implemented by two programs (VS Pascal) • Program I selects the seeds • Program II clusters documents by using C2ICM algorithm. • These programs communicate by exchanging files. clusters documents text files Program I Seed Selector Program II C2ICM

Former Experiments • C2ICM is tested with a subset of MARIAN database (~43K documents) in 1995. • 6 experiments are done. Each incremental update added ~6K documents to the different sizes of initially clustered documents

Results for the Former Experiments • C2ICM provides time savings • Clusters formed with C2ICM was very similar to the clusters formed with C3M

Conclusion • Cluster maintenance problem is challenging • Our aim is to conduct experiments for C2ICM with very large number of documents (i.e. millions of documents) • HARD dataset will be used for evaluation. Information retrieval performance will be measured. • Implementation of C2ICM must be time and memory efficient.

References • Can, F., Ozkarahan, E. A. "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases." ACM Transactions on Database Systems. Vol. 15, No. 4 (December, 1990), pp. 483-517. • Can, F. "Incremental clustering for dynamic information processing." ACM Transactions on Information Systems. Vol. 11, No. 2 (April, 1993), 143-164. • Can, F., Fox, E. A., Snavely, C. D., France, R. K. "Incremental clustering for very large document databases: initial MARIAN experience." Information Sciences. Vol. 84 (1995), pp. 101-114. • A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999

Questions?

Very Large-Scale Incremental Clustering

Very Large-Scale Incremental Clustering

Presentation Transcript

Very Large Scale Computing In Accelerator Physics

Generating and Solving Very Large-Scale Vehicle Routing Problems

Large-scale Incremental Processing Using Distributed Transactions and Notifications

Very Large-Scale Incremental Clustering

Very Large Databases

Large-scale Incremental Processing Using Distributed Transactions and Notifications

Spectral Feature Selection for Handling Very Large Scale Problems

Large-scale Single-pass k-Means Clustering at Scale

LARGE SCALE

Observations of Large Scale Structure: Measures of Galaxy Clustering

Very Large-Scale Multi-Agent Systems and Emergent Macroeconomics

Very Large Scale Neighborhood Search

A Large Scale Medical Volume Rendering on Clustering System

Large scale

Very large numbers!

Clustering Very Large Multi-dimensional Datasets with MapReduce

Semi-supervised Relation Extraction with Large-scale Word Clustering

Visualizing Very Large Scale Earthquake Simulations (SC 2003)

Very large numbers!

CS6282 Very Large Scale Distributed Systems