Evaluating FERMI features for Data Mining Applications

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation SindujaMuralidharan Advised by: Dr. GaganAgrawal

Outline • Motivation and Background • FERMI series and the TESLA series GPUs • Reduction based Data Mining Algorithms • Parallelization Methods for GPUs • Experimental Evaluation • Conclusion

Background • GPUs have emerged as a major player in high performance computing recently. • Excellent price to performance ratio provided by GPUs • suitability and popularity of CUDA to program a variety of high performance applications. • GPU hardware and software have evolved rapidly. New GPU products and successive versions of CUDA added new functionality and better performance.

The FERMI GPU • The Fermi series of cards • include the C2050 and the C2070 cards. • also referred to as the 20-series family of NVIDIA Tesla GPUs. • Support for double precision atomic operations. • Much larger shared memory/L1 cache which can be configured • 48kB shared memory, 16kB L1 cache or • 16kB shared memory, 48kb L1 cache • Presence of an L2 cache.

TESLA vs FERMI

Thesis Objective • Optimizing and evaluating the new features of the Fermi series GPUs • Increased Shared memory • Support for atomic operations on floating point data • Using three parallelization approaches on reduction based mining algorithms: • Full Replication in Shared memory • Improving locking with inbuilt atomic operations • Creation of several hybrid versions for optimal performance

Generalized Reductions • op is a function that is both commutative and associative and Reducis a data structure referred to as the reduction object • Specific elements of the reduction object updated depend on results of previous processing • Divide the data instances (or records or transactions) among the processing threads • The reduction object updated in iterationiof the loop is determined as a result of previous processing

Parallelizing Generalized Reductions • It is not possible to statically partition the reduction object, different processors update different disjoint portions at runtime: • Can lead race conditions • Execution time of the process function can take up a major chunk of the total execution time of an iteration of the loop, so runtime preprocessing and static scheduling techniques cannot be applied. • Sometimes the size of the reduction object may be too large to fit in replicas in memory without significant overheads.

Earlier Parallelization Techniques • Attempts to parallelize the Map-Reduce class of applications • lack of support for atomic operations on floating point numbers • large number of threads required for effective parallelization. • The larger shared memory allows total replication of the reduction object for some thread configurations • significantly avoids the possibility of race conditions and thread contention.

Full Replication • In any shared memory system, the best way to avoid race conditions would be to • Have each thread keep its own copy of the reduction object on the device memory and process each object separately. • Then at the end of each iteration, a global combination could be performed either by a single thread or by using the tree structure. • The final object is copied back to host memory

Full Replication in Shared Memory • The factors which affect performance of full replication mode of reduction • size of the reduction object (depends on the number of threads per multiprocessor). • the amount of computation in comparison to the amount of data copied between devices and • whether or not, global data can be copied into shared memory. • In Tesla it was not possible to fit in all the copies of the reduction object within 16k of shared memory available • Higher latency device memory had to be used.

Full Replication on Shared memory (continued) • The higher amount of available shared memory in Fermi can fit in all copies of the reduction object entirely within the shared memory for smaller configurations: • No race conditions and contention among threads because each thread updates its own copy of the object. • Global memory accesses are now replaced by low latency shared memory accesses.

Locking Scheme • The shared memories of different multiprocessors, have no synchronization mechanism, so a separate copy of the reduction object is placed in the shared memory of each multiprocessor. • While performing updates on the reduction object, all threads of a thread block use locking to avoid race conditions. • Finally a global combination is performed on all the accumulated updates on the different multiprocessors.

Locking : TESLA vs FERMI • Fine Grained Locking: • TESLA: • FERMI:

The Hybrid Scheme • Full replication • A private copy of the reduction object is needed for each thread in a block • Larger reduction objects stored in the high latency global device memory. • The cost of combination could be very high. • Locking • A single copy of the reduction object is stored in the shared memory • Eliminates the need for global combination. • Contention among threads in a block is very high. • Configuring an application with a larger number of threads in a multiprocessor typically leads to better performance. • Latencies can be masked by context switching between warps.

The Hybrid Scheme (continued) • While choosing the number of groups, M • M copies of the reduction object should still fit into the shared memory. • If the reduction object is big, the overhead of combination would be higher than the overhead of contention. • When the object is smaller, the contention overhead dominates over the combination overhead. • Since it is desirable to keep the contention overhead smaller, a larger number of groups are preferable. • Several Hybrid versions were created and evaluated on Fermi • to study the optimal balance between contention and combination overheads

Experimental Evaluation • Environment: • TESLA: NVIDIA Tesla C1060 GPU with 240 cores, clock frequency of 1.296 GHz and 4GB device memory. • FERMI: NVIDIA Tesla C2050 GPU with 448 processor cores, clock frequency of 1.15GHz and a device memory of 3 GB.

Observations • For larger reduction objects, the hybrid approach generally outperforms the replication and the locking approaches. • Contention overhead dominates. • For smaller reduction objects full replication in shared memory yields the best performance. • Combination overhead dominates. • Inbuilt support for atomic floating point operations outperforms the previously used wrapper based implementation.

K-Means Results Wrapper based implementation of atomic floating point operations k=10 Inbuilt support for atomic floating point operations k=10

Kmeans Results Wrapper based implementation of atomic floating point operations k=100 Inbuilt support for atomic floating point operations k=100

K-Means Results Hybrid Versions for k=10 Hybrid Versions for k=100

PCA Results Comparison of Parallelization schemes with wrapper based implementation for 16 columns Comparison of Parallelization schemes with inbuilt atomic floating point for 32 columns

PCA - Results Hybrid versions for 16 columns Hybrid versions for 32 columns

kNN- Results

kNN - Results

Conclusions • The new features of the Fermi series GPU cards: • support for inbuilt atomic double precision operations • increase in the amount of available shared memory • Evaluated against three reduction based data mining algorithms. • Balance between the overheads of thread contention and global combination. • For smaller clusters, contention is a dominant factor. • For larger clusters, combination overhead dominates.

Thank You!Questions?

Evaluating FERMI features for Data Mining Applications