240 likes | 473 Views
Parallel C3M. Aylin Tokuç Erkan Okuyan Özlem Gür. Outline. Basics of Parallel computing Sequential C3M Parallel C3M. Parallel Comp u tat i on. Decomposition: The process of dividing a computation into smaller parts.
E N D
Parallel C3M Aylin Tokuç Erkan Okuyan Özlem Gür Parallel C3M
Outline • Basics of Parallel computing • Sequential C3M • Parallel C3M Parallel C3M
Parallel Computation Decomposition: The process of dividing a computation into smaller parts. Task: Programmer defined units of computation into which the main computation is subdivided by means of decomposition. Parallel C3M
Parallel Computation Primary Considerations • Load Balancing • Minimizing Communication • Task Dependency Optimization Parallel C3M
Parallel ComputationLoad Balancing Parallel C3M
Parallel Computation Minimizing Communication Parallel C3M
Parallel Computation Task Dependency Optimization Parallel C3M
C3M Algorithm 1- Determine the cluster seeds of the database. 2- if d, is not a cluster seed then Find the cluster seed (if any) that maximally covers d 3- If there remain unclustered documents, group them into a ragbag cluster. Parallel C3M
C3M Formulas Parallel C3M
C3M – Sample Matrices Parallel C3M
Parallel C3M- Distribution Distribute rows among processors • Load balancing by cyclic block distribution Parallel C3M
Local Calculations All processors calculate α, partial β and Pi Current Method for Weighted Matrix: too costly Need coloumn vectors (but row-wise partitioned) Parallel C3M
Seed Powers Pi • Seed power Pi, should be small for a document whose terms appear in too many documents or too few documents. • Seed power Pi, should be bigger for a document whose terms appear in a moderate number of documents. Parallel C3M
Minimize Communication - Proposed Heuristic All processors calculate α, partial β and β’ # of non-zeros Parallel C3M
Effectiveness of Heuristic • A matlab script is written to compare the effectiveness of the proposed heuristic. • Correlation Coeeficient = 0.95 Parallel C3M
Communication btw Processors • Partial β and β’ vectors are exchanged btw processors to calculate the final β and β’ vectors. • Then, all processor calculate cii=δi Parallel C3M
# of Clusters • Processors exchange local δ • All processors calculate nc Parallel C3M
Cluster-head Selection • Calculate seed power of local documents • Exchange largest nc seed powers. • Calculate largest nc seed powers among all Pi and find cluster heads. Parallel C3M
Clustering Non-seed Docs • Exchange seed documents • Cluster non-seed documents (as in sequential C3M) in each processor. Parallel C3M
Future Work • Term Based Clustering • Overlapping Clusters Parallel C3M
C3M Summary • Load Balancing with cyclic block distribution • Communication minimization by a new heuristic • Task dependency minimized with block distirbution & heuristic. Parallel C3M
References • Concepts and the effectiveness of the cover coefficient-based clustering methodology, F. Can, E. A. Ozkarahan • Parallelizing the Buckshot Algorithm for EfficientDocument Clustering, Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder • Clustering and Classification of Large Document Basesin a Parallel Environment, Anthony S. Ruocco, Ophir Frieder • Efficient Clustering of Very Large Document Collections, I.S. Dhillon, J. Fan, Y. Guan Parallel C3M
Questions? Parallel C3M
The End Thank you for your patience Parallel C3M