250 likes | 482 Views
Very Large-Scale Incremental Clustering . Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007. Table of Contents. Why Clustering? Why Incremental Clustering? Related Work Incremental C3M (C2ICM) A Former Implementation of C2ICM for very large datasets Conclusion. Why clustering ?.
E N D
Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007
Table of Contents • Why Clustering? • Why Incremental Clustering? • Related Work • Incremental C3M (C2ICM) • A Former Implementation of C2ICM for very large datasets • Conclusion
Why clustering ? • It is an effective tool to manage information overload • To browse large document collections quickly • To easily grasp the distinct topics and subtopics (concept hierarchies) • To allow search engines to efficiently query large document collections
Types of Clustering • Hierarchical vs. Non-hierarchical • Partitional vs. Agglomerative • Deterministic vs. Probabilistic algorithms • Incremental vs. Batch algorithms
Why Incremental Clustering ? • The current information explosion • Popular sources of informational text documents such as Newswire and Blogs • Delay would be unacceptable in several important areas
Related Work • The cluster-splitting approach • Adaptive clustering based on user queries • Cobweb algorithm • Hierarchical Clustering in Incremental manner
C2ICM Algorithm • C3M is known as an efficient, effective and robust algorithm for clustering documents • C3M is well-developed for initial clustering, but maintenance is also necessary in clustering
C2ICM Algorithm • C2ICM algorithm is based on cover coefficient concept as C3M. • C2ICM is suitable for dynamic environments where there are additions and deletions of documents • With C2ICM, reclustering for each update is avoided.
C2ICM Algorithm Details • First we compute the number of clusters and cluster seed powers in the updated database • Then we determine the newly added documents and falsified documents
C2ICM Algorithm Details • How do the clusters become false? • When a seed document becomes non-seed or is deleted • One or more non-seed documents of that cluster becomes seed
C2ICM Algorithm Details • We cluster these documents by assigning them to the cluster of the seed that covers them most • The documents which does not belong to any cluster are grouped into ragbag cluster
C2ICM: An example • Current state of the clusters Seed List d1 d6 d12 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d18 d16 d17 d12 d13 d14 d19 Ragbag cluster
C2ICM: CASE 1 • When a seed document becomes nonseed Old Seed List d1 d6 d12 New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 New documents arrived d18 d16 d17 d12 d13 d14 d19 d20 d21 d22 The set of documents to be clustered
C2ICM: CASE 1 • Seed document d12 becomes nonseed New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d22 d13 d14 d12 d16 d17 d18 d19 d20 d21 The set of documents to be clustered
C2ICM: CASE 1 • Final clusters New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d20 d16 d12 d13 d18 d21 d14 d17 d19 d22 No elements remaining in the ragbag cluster
C2ICM: CASE 2 • When a nonseed document in a cluster becomes seed Old Seed List d1 d6 d12 New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 New documents arrived d18 d16 d17 d12 d13 d14 d19 d20 d21 d22 The set of documents to be clustered
C2ICM: CASE 2 • Nonseed document d14 becomes seed. New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 Becomes new seed d12 d13 d14 d16 d17 d18 d19 d20 d21 d22 The set of documents to be clustered
C2ICM: CASE 2 • Final clusters New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d20 d16 d13 d12 d22 d18 d21 d19 d17 d14 No elements remaining in the ragbag cluster Becomes new seed
A Former Implementation of C2ICM for Very Large Datasets • C2ICM is implemented by two programs (VS Pascal) • Program I selects the seeds • Program II clusters documents by using C2ICM algorithm. • These programs communicate by exchanging files. clusters documents text files Program I Seed Selector Program II C2ICM
Former Experiments • C2ICM is tested with a subset of MARIAN database (~43K documents) in 1995. • 6 experiments are done. Each incremental update added ~6K documents to the different sizes of initially clustered documents
Results for the Former Experiments • C2ICM provides time savings • Clusters formed with C2ICM was very similar to the clusters formed with C3M
Conclusion • Cluster maintenance problem is challenging • Our aim is to conduct experiments for C2ICM with very large number of documents (i.e. millions of documents) • HARD dataset will be used for evaluation. Information retrieval performance will be measured. • Implementation of C2ICM must be time and memory efficient.
References • Can, F., Ozkarahan, E. A. "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases." ACM Transactions on Database Systems. Vol. 15, No. 4 (December, 1990), pp. 483-517. • Can, F. "Incremental clustering for dynamic information processing." ACM Transactions on Information Systems. Vol. 11, No. 2 (April, 1993), 143-164. • Can, F., Fox, E. A., Snavely, C. D., France, R. K. "Incremental clustering for very large document databases: initial MARIAN experience." Information Sciences. Vol. 84 (1995), pp. 101-114. • A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999