1 / 29

Evaluating FERMI features for Data Mining Applications

Evaluating FERMI features for Data Mining Applications. Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal. Outline. Motivation and Background FERMI series and the TESLA series GPUs Reduction based Data Mining Algorithms Parallelization Methods for GPUs

ting
Download Presentation

Evaluating FERMI features for Data Mining Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation SindujaMuralidharan Advised by: Dr. GaganAgrawal

  2. Outline • Motivation and Background • FERMI series and the TESLA series GPUs • Reduction based Data Mining Algorithms • Parallelization Methods for GPUs • Experimental Evaluation • Conclusion

  3. Background • GPUs have emerged as a major player in high performance computing recently. • Excellent price to performance ratio provided by GPUs • suitability and popularity of CUDA to program a variety of high performance applications. • GPU hardware and software have evolved rapidly. New GPU products and successive versions of CUDA added new functionality and better performance.

  4. The FERMI GPU • The Fermi series of cards • include the C2050 and the C2070 cards. • also referred to as the 20-series family of NVIDIA Tesla GPUs. • Support for double precision atomic operations. • Much larger shared memory/L1 cache which can be configured • 48kB shared memory, 16kB L1 cache or • 16kB shared memory, 48kb L1 cache • Presence of an L2 cache.

  5. TESLA vs FERMI

  6. Thesis Objective • Optimizing and evaluating the new features of the Fermi series GPUs • Increased Shared memory • Support for atomic operations on floating point data • Using three parallelization approaches on reduction based mining algorithms: • Full Replication in Shared memory • Improving locking with inbuilt atomic operations • Creation of several hybrid versions for optimal performance

  7. Generalized Reductions • op is a function that is both commutative and associative and Reducis a data structure referred to as the reduction object • Specific elements of the reduction object updated depend on results of previous processing • Divide the data instances (or records or transactions) among the processing threads • The reduction object updated in iterationiof the loop is determined as a result of previous processing

  8. Parallelizing Generalized Reductions • It is not possible to statically partition the reduction object, different processors update different disjoint portions at runtime: • Can lead race conditions • Execution time of the process function can take up a major chunk of the total execution time of an iteration of the loop, so runtime preprocessing and static scheduling techniques cannot be applied. • Sometimes the size of the reduction object may be too large to fit in replicas in memory without significant overheads.

  9. Earlier Parallelization Techniques • Attempts to parallelize the Map-Reduce class of applications • lack of support for atomic operations on floating point numbers • large number of threads required for effective parallelization. • The larger shared memory allows total replication of the reduction object for some thread configurations • significantly avoids the possibility of race conditions and thread contention.

  10. Full Replication • In any shared memory system, the best way to avoid race conditions would be to • Have each thread keep its own copy of the reduction object on the device memory and process each object separately. • Then at the end of each iteration, a global combination could be performed either by a single thread or by using the tree structure. • The final object is copied back to host memory

  11. Full Replication in Shared Memory • The factors which affect performance of full replication mode of reduction • size of the reduction object (depends on the number of threads per multiprocessor). • the amount of computation in comparison to the amount of data copied between devices and • whether or not, global data can be copied into shared memory. • In Tesla it was not possible to fit in all the copies of the reduction object within 16k of shared memory available • Higher latency device memory had to be used.

  12. Full Replication on Shared memory (continued) • The higher amount of available shared memory in Fermi can fit in all copies of the reduction object entirely within the shared memory for smaller configurations: • No race conditions and contention among threads because each thread updates its own copy of the object. • Global memory accesses are now replaced by low latency shared memory accesses.

  13. Locking Scheme • The shared memories of different multiprocessors, have no synchronization mechanism, so a separate copy of the reduction object is placed in the shared memory of each multiprocessor. • While performing updates on the reduction object, all threads of a thread block use locking to avoid race conditions. • Finally a global combination is performed on all the accumulated updates on the different multiprocessors.

  14. Locking : TESLA vs FERMI • Fine Grained Locking: • TESLA: • FERMI:

  15. The Hybrid Scheme • Full replication • A private copy of the reduction object is needed for each thread in a block • Larger reduction objects stored in the high latency global device memory. • The cost of combination could be very high. • Locking • A single copy of the reduction object is stored in the shared memory • Eliminates the need for global combination. • Contention among threads in a block is very high. • Configuring an application with a larger number of threads in a multiprocessor typically leads to better performance. • Latencies can be masked by context switching between warps.

  16. The Hybrid Scheme (continued) • While choosing the number of groups, M • M copies of the reduction object should still fit into the shared memory. • If the reduction object is big, the overhead of combination would be higher than the overhead of contention. • When the object is smaller, the contention overhead dominates over the combination overhead. • Since it is desirable to keep the contention overhead smaller, a larger number of groups are preferable. • Several Hybrid versions were created and evaluated on Fermi • to study the optimal balance between contention and combination overheads

  17. Experimental Evaluation • Environment: • TESLA: NVIDIA Tesla C1060 GPU with 240 cores, clock frequency of 1.296 GHz and 4GB device memory. • FERMI: NVIDIA Tesla C2050 GPU with 448 processor cores, clock frequency of 1.15GHz and a device memory of 3 GB.

  18. Observations • For larger reduction objects, the hybrid approach generally outperforms the replication and the locking approaches. • Contention overhead dominates. • For smaller reduction objects full replication in shared memory yields the best performance. • Combination overhead dominates. • Inbuilt support for atomic floating point operations outperforms the previously used wrapper based implementation.

  19. K-Means Results Wrapper based implementation of atomic floating point operations k=10 Inbuilt support for atomic floating point operations k=10

  20. Kmeans Results Wrapper based implementation of atomic floating point operations k=100 Inbuilt support for atomic floating point operations k=100

  21. K-Means Results Hybrid Versions for k=10 Hybrid Versions for k=100

  22. PCA Results Comparison of Parallelization schemes with wrapper based implementation for 16 columns Comparison of Parallelization schemes with inbuilt atomic floating point for 32 columns

  23. PCA - Results Hybrid versions for 16 columns Hybrid versions for 32 columns

  24. kNN- Results

  25. kNN - Results

  26. Conclusions • The new features of the Fermi series GPU cards: • support for inbuilt atomic double precision operations • increase in the amount of available shared memory • Evaluated against three reduction based data mining algorithms. • Balance between the overheads of thread contention and global combination. • For smaller clusters, contention is a dominant factor. • For larger clusters, combination overhead dominates.

  27. Thank You!Questions?

More Related