Scalable Clustering using Multiple GPUs

Scalable Clustering using Multiple GPUs K WasifMohiuddinP J Narayanan Center for Visual Information TechnologyInternational Institute of Information Technology (IIIT)Hyderabad

Introduction • Classification of data desired for meaningful representation. • Data in subset ideally shares common traits. • Unsupervised learning for finding hidden structure. • Application in data mining, computer vision with • Image Classification • Document Retrieval • Simple K-Means algorithm HiPC - 2011

Need for High Performance Clustering • Time Complexity of O(ndk+1 log n) where n- input vectors, d- dimension, k-centers • A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers. • In computer vision, 128-dim SIFT and 512-dim (GIST) are common. Features can run into several millions • Bag of Words for Vocabulary generation using SIFT[Lowe IJCV04] vectors HiPC - 2011

Challenges and Contributions • Data: Storage of data format for quick and repeated access. • Computational: O(ndk+1 log n) • Contributions: A complete GPU based implementation with • Exploitation of intra-vector parallelism • Efficient Mean evaluation • Data Organization • Multi GPU framework HiPC - 2011

Related Work • General Improvements • KD-trees [Moor et al, SIGKKD-1999 • Triangle Inequality [Elkan, ICML-2003] • Pre CUDA GPU Efforts Improvements • Fragment Shader[Hart et al, SIGGRAPH-2004] HiPC - 2011

Related Work (cont) • Recent GPU efforts • Mean on CPU [Che et al, JPDC-2008] • Mean on CPU + GPU [Hong et al, WCCSIE-2009] • GPU Miner [Ren et al, HP Labs-2009] • HPK-Means [Wu et al, UCHPC-2009] • Divide & Rule [Li et al, ICCIT-2010] • Parallelism not exploited within data object. • Lacking efficiency in Mean evaluation on GPU. • Proposed techniques are parameter dependant. HiPC - 2011

K-Means • Objective Function ∑i∑j‖xi(j)-cj‖2 1≤i≤n, 1≤ j ≤k • Euclidean distance : L2 norm • Steps: • Membership Evaluation • New Mean Evaluation • Convergence HiPC - 2011

Algorithm • K random centers are initially chosen from input. • Partitions data into k clusters • Observation belongs to the cluster with the nearest mean. • Re-evaluate the new centers & continue the process till convergence is attained. HiPC - 2011

K-Means on GPU Membership Evaluation • Involves Distance and Minima evaluation. • Single thread per component of vector • Parallel computation done on ‘d’ components of input and center vectors stored in row major format. • Log summation for distance evaluation. • For each input vector we traverse across all centers stored in L2 cache. HiPC - 2011

K-Means on GPU (Cont) Membership Evaluation • Data objects stored in row major format • Provides coalesced access • Distance evaluation using shared memory. • Root finding avoided HiPC - 2011

K-Means on GPU (Cont) • Mean Evaluation Issues • Data rearrangement on CPU as per membership is time consuming. • Concurrent writes • Random reads and writes • Non uniform distribution of labels for data objects. HiPC - 2011

Mean Evaluation on GPU • Store labels and index in 64 bit records • Group data objects with same membership using Splitsortoperation. • We split using labels as key • Gather primitive used to rearrange input in order of labels. • Sorted global index of input vectors is generated. Splitsort: Suryakant & Narayanan IIITH, TR 2009 HiPC - 2011

Splitsort & Transpose Operation HiPC - 2011

Mean Evaluation on GPU (cont) • Row major storage of vectors enabled coalesced access. • CUDPP segmented scan followed by compact operation for histogram count. • Transpose operation before rearranging input vectors. • Using segmented scan again we evaluated mean of rearranged vectors as per labels. HiPC - 2011

Implementation Details • Tesla • 2 vectors per block , 2 centers at a time • Centers accessed via shared memory • Fermi • 2 vectors per block, 4 centers at a time • Centers accessed via global memory using L2 cache • More shared memory for distance evaluation • Occupancy of 83% using 5136 KB of shared memory in case of Fermi. HiPC - 2011

ISSUES • Too many distance evaluations • Convergence highly dependent on cluster centers chosen. • Prior seeding using K-Means++ can reduce the number of iterations. • Parameters like dimension, cluster centers affect the performance apart from the input size of the vectors. HiPC - 2011

Limitations of GPU device • Highly computational & memory consuming algorithms. • Limited Global and Shared memory on a GPU device. • Division of computational load if more than one device is available. • Utilization of every resource available. • Scalability of the algorithm HiPC - 2011

Multi GPU Approach • Partition input data into chunks proportional to number of cores. • Broadcast ‘k’ centers to all the nodes. • Perform Membership & partial mean on each of the GPUs sent to their respective nodes. • Nodes direct partial sums to Master node. • New means evaluated by Master node for next iteration. HiPC - 2011

Results • Generated Gaussian SIFT vectors • Variation in parameters n, d, k • Performance on CPU(32 bit, 2.7 Ghz), Tesla T10, GTX 480 tested up to nmax :4 Million, kmax : 8000 , dmax : 256 • MultiGPU (4xT10 + GTX 480) nmax : 32 Million, kmax : 8000, dmax: 256 • Comparison with previous GPU implementations. HiPC - 2011

Overall Results Times of K-Means on CPU, GPUs in seconds for d=128. HiPC - 2011

Overall Performance • Mean evaluation reduced to 6% of the total time for large input of high dimensional data. • Multi GPU provided linear speedup • Speedup of up to 170 on GTX 480 • 6 Million vectors of 128 dimension clustered in just 136 sec per iteration. • Achieved up to twice increase in speedup against the best GPU implementation HiPC - 2011

Performance vs ‘n’ Linear performance for variation in n, with d=128 and k=4,000. HiPC - 2011

Performance vs ‘d’ Performance for variation in d, with n=1M and k=8,000. HiPC - 2011

Performance vs ‘k’ Linear performance for variation in k, with n=50k and d=128. HiPC - 2011

Comparison Running time of K-Means in seconds on GTX 280. HiPC - 2011

Performance on GPUs Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000. HiPC - 2011

Conclusions • Achieved a speed of over 170 on single NVIDIA Fermi GPU. • Complete GPU based implementation. • High Performance for large ‘d’ due to processing of vector in parallel. • Scalable in problem size n, d, k and number of cores. • Use of operations like Splitsort, Transpose for coalesced memory access. • Overcame memory limitations using Multi GPU frame work. • Code will be available at http://cvit.iiit.ac.in soon HiPC - 2011

Thank You Questions?

Scalable Clustering using Multiple GPUs

Scalable Clustering using Multiple GPUs

Presentation Transcript

Building Secure, Stable and Scalable Infrastructures using Clustering and Storage Area Networks

Stadium H ashing: Scalable and Flexible Hashing on GPUs

Biosurveillance of emerging biothreats using scalable genotype clustering

Using Multiple Forms

Mr. Scan: Efficient Clustering with MRNet and GPUs

Co-clustering using CUDA

Fuzzy Clustering with Multiple Kernels

Scalable Multi-Cache Simulation Using GPUs

Scalable, Behavior-Based Malware Clustering

Scalable Data Clustering with GPUs

Hardware Acceleration Using GPUs

Scalable Supervised Dimensionality Reduction using Clustering

Software Clustering Using Bunch

Scalable Graph Clustering using Stochastic Flows Applications to Community Discovery

Scalable Clustering for Vision using GPUs

Scalable Clustering on the Data Grid

Operational Weather Forecasting using GPUs

Region-Scale Evacuation Modeling using GPUs

Scalable Web Server Clustering Technologies

Programming GPUs using Directives

Using GPUs for Rapid Electromagnetic Modeling

Scalable Clustering for Vision using GPUs