280 likes | 410 Views
Scalable Clustering using Multiple GPUs. K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute of Information Technology (IIIT) Hyderabad. Introduction. Classification of data desired for meaningful representation.
E N D
Scalable Clustering using Multiple GPUs K WasifMohiuddinP J Narayanan Center for Visual Information TechnologyInternational Institute of Information Technology (IIIT)Hyderabad
Introduction • Classification of data desired for meaningful representation. • Data in subset ideally shares common traits. • Unsupervised learning for finding hidden structure. • Application in data mining, computer vision with • Image Classification • Document Retrieval • Simple K-Means algorithm HiPC - 2011
Need for High Performance Clustering • Time Complexity of O(ndk+1 log n) where n- input vectors, d- dimension, k-centers • A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers. • In computer vision, 128-dim SIFT and 512-dim (GIST) are common. Features can run into several millions • Bag of Words for Vocabulary generation using SIFT[Lowe IJCV04] vectors HiPC - 2011
Challenges and Contributions • Data: Storage of data format for quick and repeated access. • Computational: O(ndk+1 log n) • Contributions: A complete GPU based implementation with • Exploitation of intra-vector parallelism • Efficient Mean evaluation • Data Organization • Multi GPU framework HiPC - 2011
Related Work • General Improvements • KD-trees [Moor et al, SIGKKD-1999 • Triangle Inequality [Elkan, ICML-2003] • Pre CUDA GPU Efforts Improvements • Fragment Shader[Hart et al, SIGGRAPH-2004] HiPC - 2011
Related Work (cont) • Recent GPU efforts • Mean on CPU [Che et al, JPDC-2008] • Mean on CPU + GPU [Hong et al, WCCSIE-2009] • GPU Miner [Ren et al, HP Labs-2009] • HPK-Means [Wu et al, UCHPC-2009] • Divide & Rule [Li et al, ICCIT-2010] • Parallelism not exploited within data object. • Lacking efficiency in Mean evaluation on GPU. • Proposed techniques are parameter dependant. HiPC - 2011
K-Means • Objective Function ∑i∑j‖xi(j)-cj‖2 1≤i≤n, 1≤ j ≤k • Euclidean distance : L2 norm • Steps: • Membership Evaluation • New Mean Evaluation • Convergence HiPC - 2011
Algorithm • K random centers are initially chosen from input. • Partitions data into k clusters • Observation belongs to the cluster with the nearest mean. • Re-evaluate the new centers & continue the process till convergence is attained. HiPC - 2011
K-Means on GPU Membership Evaluation • Involves Distance and Minima evaluation. • Single thread per component of vector • Parallel computation done on ‘d’ components of input and center vectors stored in row major format. • Log summation for distance evaluation. • For each input vector we traverse across all centers stored in L2 cache. HiPC - 2011
K-Means on GPU (Cont) Membership Evaluation • Data objects stored in row major format • Provides coalesced access • Distance evaluation using shared memory. • Root finding avoided HiPC - 2011
K-Means on GPU (Cont) • Mean Evaluation Issues • Data rearrangement on CPU as per membership is time consuming. • Concurrent writes • Random reads and writes • Non uniform distribution of labels for data objects. HiPC - 2011
Mean Evaluation on GPU • Store labels and index in 64 bit records • Group data objects with same membership using Splitsortoperation. • We split using labels as key • Gather primitive used to rearrange input in order of labels. • Sorted global index of input vectors is generated. Splitsort: Suryakant & Narayanan IIITH, TR 2009 HiPC - 2011
Splitsort & Transpose Operation HiPC - 2011
Mean Evaluation on GPU (cont) • Row major storage of vectors enabled coalesced access. • CUDPP segmented scan followed by compact operation for histogram count. • Transpose operation before rearranging input vectors. • Using segmented scan again we evaluated mean of rearranged vectors as per labels. HiPC - 2011
Implementation Details • Tesla • 2 vectors per block , 2 centers at a time • Centers accessed via shared memory • Fermi • 2 vectors per block, 4 centers at a time • Centers accessed via global memory using L2 cache • More shared memory for distance evaluation • Occupancy of 83% using 5136 KB of shared memory in case of Fermi. HiPC - 2011
ISSUES • Too many distance evaluations • Convergence highly dependent on cluster centers chosen. • Prior seeding using K-Means++ can reduce the number of iterations. • Parameters like dimension, cluster centers affect the performance apart from the input size of the vectors. HiPC - 2011
Limitations of GPU device • Highly computational & memory consuming algorithms. • Limited Global and Shared memory on a GPU device. • Division of computational load if more than one device is available. • Utilization of every resource available. • Scalability of the algorithm HiPC - 2011
Multi GPU Approach • Partition input data into chunks proportional to number of cores. • Broadcast ‘k’ centers to all the nodes. • Perform Membership & partial mean on each of the GPUs sent to their respective nodes. • Nodes direct partial sums to Master node. • New means evaluated by Master node for next iteration. HiPC - 2011
Results • Generated Gaussian SIFT vectors • Variation in parameters n, d, k • Performance on CPU(32 bit, 2.7 Ghz), Tesla T10, GTX 480 tested up to nmax :4 Million, kmax : 8000 , dmax : 256 • MultiGPU (4xT10 + GTX 480) nmax : 32 Million, kmax : 8000, dmax: 256 • Comparison with previous GPU implementations. HiPC - 2011
Overall Results Times of K-Means on CPU, GPUs in seconds for d=128. HiPC - 2011
Overall Performance • Mean evaluation reduced to 6% of the total time for large input of high dimensional data. • Multi GPU provided linear speedup • Speedup of up to 170 on GTX 480 • 6 Million vectors of 128 dimension clustered in just 136 sec per iteration. • Achieved up to twice increase in speedup against the best GPU implementation HiPC - 2011
Performance vs ‘n’ Linear performance for variation in n, with d=128 and k=4,000. HiPC - 2011
Performance vs ‘d’ Performance for variation in d, with n=1M and k=8,000. HiPC - 2011
Performance vs ‘k’ Linear performance for variation in k, with n=50k and d=128. HiPC - 2011
Comparison Running time of K-Means in seconds on GTX 280. HiPC - 2011
Performance on GPUs Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000. HiPC - 2011
Conclusions • Achieved a speed of over 170 on single NVIDIA Fermi GPU. • Complete GPU based implementation. • High Performance for large ‘d’ due to processing of vector in parallel. • Scalable in problem size n, d, k and number of cores. • Use of operations like Splitsort, Transpose for coalesced memory access. • Overcame memory limitations using Multi GPU frame work. • Code will be available at http://cvit.iiit.ac.in soon HiPC - 2011
Thank You Questions?