600 likes | 613 Views
Scalable Clustering for Vision using GPUs. K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute of Information Technology (IIIT) Hyderabad. Publications. K Wasif Mohiuddin and P J Narayanan Scalable Clustering using Multiple GPUs.
E N D
Scalable Clustering for Vision using GPUs K WasifMohiuddinP J Narayanan Center for Visual Information TechnologyInternational Institute of Information Technology (IIIT)Hyderabad
Publications • K Wasif Mohiuddin and P J Narayanan Scalable Clustering using Multiple GPUs. HIPC `11 (Conference on High Performance Computing), Bangalore, India) 2) K Wasif Mohiuddin and P J Narayanan GPU Assisted Video Organizing Application. ICCV`11, Workshop on GPU in Computer Vision Applications, Barcelona, Spain).
Presentation Flow • Scalable Clustering on Multiple GPUs • GPU assisted Personal Video Organizer
Introduction • Classification of data desired for meaningful representation. • Unsupervised learning for finding hidden structure. • Application in computer vision, data mining with • Image Classification • Document Retrieval • K-Means algorithm
Clustering Mean Evaluation Select Centers Labeling Relabeling
Need for High Performance Clustering • Clustering 125k vectors of 128 dimension with 2k clusters took nearly 8 minutes on CPU per iteration. • A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers. • In computer vision, SIFT(128 dim) and GIST are common. Features can run into several millions • Bag of Words for Vocabulary generation using SIFTvectors
Challenges and Contributions • Computational: O(ndk+1 log n) • Growing n, k, d for large scale applications. • Contributions: A complete GPU based implementation with • Exploitation of intra-vector parallelism • Efficient Mean evaluation • Data Organization for coalesced access • Multi GPU framework
Related Work • General Improvements • KD-trees [Moor et al, SIGKKD-1999] • Triangle Inequality [Elkan, ICML-2003] • Distributed Systems [Dhillon et al, LSPDM-2000] • Pre CUDA GPU Efforts Improvements • Fragment Shader [Hall et al, SIGGRAPH-2004]
Related Work (cont) • Recent GPU efforts • Mean on CPU [Che et al, JPDC-2008] • Mean on CPU + GPU [Hong et al, WCCSIE-2009] • GPU Miner [Wenbin et al, HKUSTCS-2008] • HPK-Means [Wu et al, UCHPC-2009] • Divide & Rule [Li et al, ICCIT-2010] • One thread assigned per vector. Parallelism not exploited within data object. • Lacking efficiency in Mean evaluation • Proposed techniques are parameter dependant.
K-Means • Objective Function ∑i∑j‖xi(j)-cj‖2 1≤ i ≤n, 1≤ j ≤k • K random centers are initially chosen from input data objects. • Steps: • Membership Evaluation • New Mean Evaluation • Convergence
GPU Architecture • Fermi architecture has16 Streaming Multiprocessors (SM) • Each SM having 32 cores, so overall has 512 CUDA cores. • Kernel’s unleash multiple threads to perform a task in a Single Instruction Multiple Data (SIMD) fashion. • Each SM has registers divided equally amongst its threads. Each thread has a private local memory. • Single unified memory request path for loads and stores using the L1 cache per SM and L2 cache that services all operations • Double precision, faster context switching, faster atomic operations and multiple kernel execution
K-Means on GPU Membership Evaluation • Involves Distance and Minima evaluation. • Single thread per component of vector • Parallel computation done on ‘d’ components of input and center vectors stored in row major format. • Log summation for distance evaluation. • For each input vector we traverse across all centers.
Membership on GPU Center Vectors 1 2 p p i Label Input Vector k-1 k
Membership on GPU(Cont) • Data objects stored in row major format • Provides coalesced access • Distance evaluation using shared memory. • Square root finding avoided
K-Means on GPU (Cont) • Mean Evaluation Issues • Random reads and writes • Concurrent writes • Non uniform distribution of data objects per label. Read/Write Write Threads Data
Mean Evaluation on GPU • Store labels and index in 64 bit records • Group data objects with same membership using Splitsort operation. • We split using labels as key • Gather primitive used to rearrange input in order of labels. • Sorted global index of input vectors is generated. Splitsort : Suryakant & Narayanan IIITH, TR 2009
Mean Evaluation on GPU (cont) • Row major storage of vectors enabled coalesced access. • Segmented scan followed by compact operation for histogram count. • Transpose operation before rearranging input vectors. • Using segmented scan again we evaluated mean of rearranged vectors as per labels.
Implementation Details • Tesla • 2 vectors per block , 2 centers at a time • Centers accessed via texture memory • Fermi • 2 vectors per block, 4 centers at a time • Centers accessed via global memory using L2 cache • More shared memory for distance evaluation • Occupancy of 83% achieved in case of Fermi and Tesla.
Limitations of a GPU device • Highly computational & memory consuming algorithm. • Overloading on GPU device • Limited Global and Shared memory on a GPU device. • Handling of large data vectors • Scalability of the algorithm
Multi GPU Approach • Partition input data into chunks proportional to number of cores. • Broadcast ‘k’ centers to all the nodes. • Perform Membership & partial mean on each of the GPUs sent to their respective nodes.
Multi GPU Approach (cont) • Nodes direct partial sums to Master node. • New means evaluated by Master node for next iteration. Master Node S = Sa+Sb+…..+Sz Sa Sb Sz New Centers Node A Node B Node Z
Results • Generated Gaussian SIFT vectors • Variation in parameters n, d, k • Performance on CPU(1 Gb RAM, 2.7 Ghz), Tesla T10, GTX 480, 8600 tested up to nmax :4 Million, kmax : 8000 , dmax : 256 • MultiGPU (4xT10 + GTX 480) using MPI nmax : 32 Million, kmax : 8000, dmax : 256 • Comparison with previous GPU implementations.
Overall Results Times of K-Means on CPU, GPUs in seconds for d=128.
Performance on GPUs Performance of 8600 (32 cores), Tesla(240 cores), GTX 480(480 cores) for d=128 and k=1,000.
Performance vs ‘n’ Linear in n, with d=128 and k=4,000.
Overall Performance • Multi GPU provided linear speedup • Speedup of up to 170 on GTX 480 • 6 Million vectors of 128 dimension clustered in just 136 sec per iteration. • Low end GPUs provide nearly 10-20 times of speedup.
Comparison Up to twice increase in speedup against the best GPU implementation on GTX 280
Multi GPU Results Scalable to number of cores in a Multi GPU, Results on Tesla, GTX 480 in seconds for d=128, k=4000
Time Division Time on GTX 480 device. Mean evaluation reduced to 6% of the total time for large input of high dimensional data.
Presentation Flow • Scalable Clustering on Multiple GPUs • GPU assisted Personal Video Organizer
Motivation • Many and varied videos in everyone’s collection and growing every day • Sports, TV Shows, Movies, home events, etc. • Categorizing them based on content useful • No effective tools for video (or images) • Existing efforts are very category specific • Can’t need heavy training or large clusters of computers • Goal: Personal categorization performed on personalmachines • Training and testing on a personal scale
Challenges and Contributions • Algorithmic: Extend image classification to videos. • Data: Use small amount of personal videos span across wide class of categories. • Computational: Need do it on laptops or personal workstations. • Contributions: A video organization scheme with • Learning categories from user-labelled data • Fast category assignment for the collection. • Exploiting the GPU for computation • Good performance even on personal machines
Related Work • Image Categorization • ACDSee, Dbgallery, Flickr, Picasa, etc • Image Representation • SIFT[Lowe IJCV04], GIST[Torralba IJCV01], HOG [Dalal & Triggs CVPR05] etc. • Key Frame extraction • Difference of Histograms [Gianluigi SPIE05]
Related Work…contd • Genre Classification • SVM [Ekenel et al AIEMPro2010] • HMM[Haoran et al ICICS2003] • GMM[Truong et al, ICPR2000] • Motion and color [Chen et al, JVCIR2011] • Spatio-temporal behavior [Rea et al, ICIP2000] • Involved extensive learning of categories for a specific type of videos • Not suitable for personal collections that vary greatly.
Video Classification: Steps • Category Determination • User tags videos separately for each class • Learning done using these videos • Cluster centers derived for each class • Category Assignment • Use the trained categories on remaining videos • Final assigning done based on scoring • Ambiguities resolved by user
Category Determination • Segmentation & Thresholding • Keyframe extrction & PHOG Features • K-Means Tagged Videos Segment &Threshold K-Means Clustering Keyframes & PHOG CategoryRepresentation
Work Division • Less intensive steps processed on CPU. • Computationally expensive steps moved onto GPU. • Steps like key frame extraction, feature extraction and clustering are time consuming.
Key frame Extraction Segmentation • Compute color histogram for all the frames. • Divide video into shots using the score of difference of histograms across consecutive frames. Thresholding • Shots having more than 60 frames selected. • Four equidistant frames chosen as key frames from every shot.
PHOG • Edge Contours extracted using canny edge detector. • Orientation gradients computed with a 3 x 3 Sobel mask without Gaussian smoothing. • HOG descriptor discretized into K orientation bins. • HOG vector is computed for each grid cell at each pyramid resolution level[Bosch et al. CIVR2007]
Final Representation • Cluster the accumulated key frames separately for every category. • Grouping of similar frames into single cluster. • Meaningful representation of key frames for each category is achieved. • Reduced search space for the test videos.
K-Means • Partitions ‘n’ data objects into ‘k’ partitions • Clustering of extracted training key frames. • Separately for each of the categories. • Represent each category with meaningful cluster centers. • For instance grouping frames consisting of pitch, goal post, etc. • 30 clusters per category generated.
PHOG on GPU • HoG computed using previous code [Prisacariuet al. 2009] • Gradients evaluated using convolution kernels from NVIDIA CUDA SDK. • One thread per pixel and the thread block size is 16×16. • Each thread computes its own histogram • PHOG descriptors computed by applying HOG for different scales and merging them. • Downsample the image and send to HoG.
Category Assignment • Segmentation, Thresholding, keyframes • Extract keyframes from untagged videos. • Compute PHOG for each keyframe • Classify each keyframe independently • K-Nearest Neighbor classifier • Allot each keyframe to the nearest k clusters • Final scoring for category assignment
K-Nearest Neighbor • Classification done based on closest training samples. • K nearest centers evaluated for each frame. • Euclidean distance used as distance metric.
KNN on GPU • Each block handles ‘L’ new key frames at a time loops over all key frames. • Find distances for each key frame against all centers sequentially • Deal each dimension in parallel using a thread • Find the vector distance using a log summation • Write back to global memory • Sort the distance as key for each key frame. • Keep the top k values
Scoring • Use the distance ratio r = d1 / d2 of distances d1 and d2 to the two neighbors. • If r < threshold, allot a single membership to the keyframe. Threshold used: 0.6 • Assign multiple memberships otherwise. We assign to top c/2 categories. • Final category: • Count the votes for each category for the video • If the top category is a clear winner, assign to it. (20% more score than the next) • Seek manual assignment otherwise.
Results • Selected four popular Sport categories • Cricket, Football, Tennis, Table Tennis • Collected a dataset of about 100 videos of10 to 15 minutes each. • The user tags 3 videos per category. • Rest of the videos used for testing. • 4 frames considered to represent a shot. • Roughly 200 key frames per category.