290 likes | 441 Views
Evaluating FERMI features for Data Mining Applications. Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal. Outline. Motivation and Background FERMI series and the TESLA series GPUs Reduction based Data Mining Algorithms Parallelization Methods for GPUs
E N D
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation SindujaMuralidharan Advised by: Dr. GaganAgrawal
Outline • Motivation and Background • FERMI series and the TESLA series GPUs • Reduction based Data Mining Algorithms • Parallelization Methods for GPUs • Experimental Evaluation • Conclusion
Background • GPUs have emerged as a major player in high performance computing recently. • Excellent price to performance ratio provided by GPUs • suitability and popularity of CUDA to program a variety of high performance applications. • GPU hardware and software have evolved rapidly. New GPU products and successive versions of CUDA added new functionality and better performance.
The FERMI GPU • The Fermi series of cards • include the C2050 and the C2070 cards. • also referred to as the 20-series family of NVIDIA Tesla GPUs. • Support for double precision atomic operations. • Much larger shared memory/L1 cache which can be configured • 48kB shared memory, 16kB L1 cache or • 16kB shared memory, 48kb L1 cache • Presence of an L2 cache.
Thesis Objective • Optimizing and evaluating the new features of the Fermi series GPUs • Increased Shared memory • Support for atomic operations on floating point data • Using three parallelization approaches on reduction based mining algorithms: • Full Replication in Shared memory • Improving locking with inbuilt atomic operations • Creation of several hybrid versions for optimal performance
Generalized Reductions • op is a function that is both commutative and associative and Reducis a data structure referred to as the reduction object • Specific elements of the reduction object updated depend on results of previous processing • Divide the data instances (or records or transactions) among the processing threads • The reduction object updated in iterationiof the loop is determined as a result of previous processing
Parallelizing Generalized Reductions • It is not possible to statically partition the reduction object, different processors update different disjoint portions at runtime: • Can lead race conditions • Execution time of the process function can take up a major chunk of the total execution time of an iteration of the loop, so runtime preprocessing and static scheduling techniques cannot be applied. • Sometimes the size of the reduction object may be too large to fit in replicas in memory without significant overheads.
Earlier Parallelization Techniques • Attempts to parallelize the Map-Reduce class of applications • lack of support for atomic operations on floating point numbers • large number of threads required for effective parallelization. • The larger shared memory allows total replication of the reduction object for some thread configurations • significantly avoids the possibility of race conditions and thread contention.
Full Replication • In any shared memory system, the best way to avoid race conditions would be to • Have each thread keep its own copy of the reduction object on the device memory and process each object separately. • Then at the end of each iteration, a global combination could be performed either by a single thread or by using the tree structure. • The final object is copied back to host memory
Full Replication in Shared Memory • The factors which affect performance of full replication mode of reduction • size of the reduction object (depends on the number of threads per multiprocessor). • the amount of computation in comparison to the amount of data copied between devices and • whether or not, global data can be copied into shared memory. • In Tesla it was not possible to fit in all the copies of the reduction object within 16k of shared memory available • Higher latency device memory had to be used.
Full Replication on Shared memory (continued) • The higher amount of available shared memory in Fermi can fit in all copies of the reduction object entirely within the shared memory for smaller configurations: • No race conditions and contention among threads because each thread updates its own copy of the object. • Global memory accesses are now replaced by low latency shared memory accesses.
Locking Scheme • The shared memories of different multiprocessors, have no synchronization mechanism, so a separate copy of the reduction object is placed in the shared memory of each multiprocessor. • While performing updates on the reduction object, all threads of a thread block use locking to avoid race conditions. • Finally a global combination is performed on all the accumulated updates on the different multiprocessors.
Locking : TESLA vs FERMI • Fine Grained Locking: • TESLA: • FERMI:
The Hybrid Scheme • Full replication • A private copy of the reduction object is needed for each thread in a block • Larger reduction objects stored in the high latency global device memory. • The cost of combination could be very high. • Locking • A single copy of the reduction object is stored in the shared memory • Eliminates the need for global combination. • Contention among threads in a block is very high. • Configuring an application with a larger number of threads in a multiprocessor typically leads to better performance. • Latencies can be masked by context switching between warps.
The Hybrid Scheme (continued) • While choosing the number of groups, M • M copies of the reduction object should still fit into the shared memory. • If the reduction object is big, the overhead of combination would be higher than the overhead of contention. • When the object is smaller, the contention overhead dominates over the combination overhead. • Since it is desirable to keep the contention overhead smaller, a larger number of groups are preferable. • Several Hybrid versions were created and evaluated on Fermi • to study the optimal balance between contention and combination overheads
Experimental Evaluation • Environment: • TESLA: NVIDIA Tesla C1060 GPU with 240 cores, clock frequency of 1.296 GHz and 4GB device memory. • FERMI: NVIDIA Tesla C2050 GPU with 448 processor cores, clock frequency of 1.15GHz and a device memory of 3 GB.
Observations • For larger reduction objects, the hybrid approach generally outperforms the replication and the locking approaches. • Contention overhead dominates. • For smaller reduction objects full replication in shared memory yields the best performance. • Combination overhead dominates. • Inbuilt support for atomic floating point operations outperforms the previously used wrapper based implementation.
K-Means Results Wrapper based implementation of atomic floating point operations k=10 Inbuilt support for atomic floating point operations k=10
Kmeans Results Wrapper based implementation of atomic floating point operations k=100 Inbuilt support for atomic floating point operations k=100
K-Means Results Hybrid Versions for k=10 Hybrid Versions for k=100
PCA Results Comparison of Parallelization schemes with wrapper based implementation for 16 columns Comparison of Parallelization schemes with inbuilt atomic floating point for 32 columns
PCA - Results Hybrid versions for 16 columns Hybrid versions for 32 columns
Conclusions • The new features of the Fermi series GPU cards: • support for inbuilt atomic double precision operations • increase in the amount of available shared memory • Evaluated against three reduction based data mining algorithms. • Balance between the overheads of thread contention and global combination. • For smaller clusters, contention is a dominant factor. • For larger clusters, combination overhead dominates.