1 / 30

Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms

Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms. Andrew Nere , AtifHashmi , and MikkoLipasti University of Wisconsin –Madison IPDPS 2011. Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21. Outline. Purpose Background

hank
Download Presentation

Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011 Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21

  2. Outline • Purpose • Background • Why can be parallelized • Mapping to CUDA • Optimizations methods • Experiment results

  3. The purpose of this paper Utilize NvidiaGPUs to accelerate a neocortex inspired learning algorithm

  4. What is the Neocortex? • The neocortex is the part of the brain that is unique to mammals and is mostly responsible for executive processing skills such as mathematics, music, language, vision, perception, etc. • The neocortex comprises around 77% of the entire human brain. • For a typical adult, it is estimated the neocortex has around 11.5 billion neurons and 360 trillion synapses, or connections between neurons

  5. Neocortex • Hierarchical and regular structure • Composed of cortical columns –Neuroscientist Vernon Mountcastle Mountcastle was the first to observe the structural uniformity of the neocortex. He proposed that the neocortex is composed of millions of nearly identical functional units which he termed cortical columns because of the seemingly column shaped organizations.

  6. Neocortex • Neuroscientist Hubel and Mountcastle further classified cortical columns into hypercolumnsand minicolumns.

  7. Cortical Columns • Minicolumns –80-100 neurons –Represent unique features –Share common receptive field •Hypercolumns –50-100 minicolumns –Functional unit of neocortex •Connectivity –Lateral –Feedforward(bottom-up) –Feedback (top-down)

  8. Cortical Network Model

  9. Highly Parallel

  10. NvidiaCUDA • “Compute Unified Device Architecture” • Hardware –Streaming Multiprocessors (SMs) –Shared memory (16-48KB) –DRAM (1-6GB) •Programming Framework –Threads –1000s –CTAs –groups of threads Cooperative Thread-Arrays (CTAs) –Kernel –group of CTAs

  11. Mapping to CUDA

  12. Experimental Setup

  13. GPGPU Performance

  14. Limitations of Multiple Kernels • Problem: Multiple kernel launch overhead –1 –2.5% of execution time –No CTA-CTA communication • Problem: GPGPU resources underutilized –Convergence is key part of the model / algorithm –Performance benefits diminish 50x speedup for large layers > 10x SLOWDOWN for small layers

  15. Execution time • We can see that 1-2.5% of the total execution time for a hierarchy is spent on the additional kernel launch overhead, with smaller cortical networks suffering from larger overhead.

  16. GPGPU resources underutilized (1) 50x speedup for large layers (2) > 10x SLOWDOWN for small layers

  17. Algorithmic OptimizationsPipelining to Increase Resource Utilization Solution 1: Pipeline cortical network execution (1)Single kernel with 1 hypercolumn/ CTA (2)Double buffer maintains dependencies A double buffer between hierarchy levels guarantees that producer-consumer relationships are enforced. (3)–Improve resource utilization –Multiple kernel launches to fully propagate –Increases storage overhead

  18. Software work-queue Ideally we would like to be able to execute the entire cortical architecture on the GPU concurrently, reducing the overhead to a single kernel launch. However, a limitation of the CUDA architecture is that there is no guarantee as to the order in which CTAs are scheduled. We instead create a software work-queue to explicitly orchestrate the order in which hypercolumns are executed. The work-queue is managed directly in the GPU’s global memory space, as in Figure 9.

  19. Algorithmic Optimizations:Solution 2: Work-queue (1)Single kernel launch A single CUDA-kernel is launched with only as many CTAs as can concurrently fit across all of the SMs in the GPGPU, as determined by the Occupancy calculator (Figure 9 shows 2 concurrent CTAs per Streaming-Multiprocessors(SM)). (2)Each CTA uses an atomic primitive to gain a unique index into the work-queue (solid blue arrows ’A’ and ’C’). The work-queue contains each hypercolumn’s ID in the cortical network and is organized to execute hypercolumns in order from the bottom of the hierarchy to the top. • If all input activations are available, the hypercolumn can calculate its output activations. • (in Figure 9, HC0’s inputs are ready, • while HC9 must wait for its inputs to be produced by HC0).

  20. Once a hypercolumn has calculated its output activations,they are written back to the global memory. The dashed red arrow (B) in the figure depicts how HC0 indicates to HC9 that all input activations are available via atomic increment of the flag. Finally, the CTA atomically indexes again into the work-queue to execute another hypercolumn until the work-queue is empty.

  21. (3)Concurrent CTAs execute entire cortical network -Doesn’t rely on CTA scheduler -CUDA Disclaimer –CTA to CTA communication

  22. Work Queue -Example

  23. Single GPU Optimization Results

  24. GT200 Architecture Performance

  25. Multi-GPU Systems

  26. Multi-GPU Systems

  27. Multi-GPU Results

  28. My summary Problem: Synchronization and workload imbalance Solution: Key algorithmic optimizations Profiling / distributing cortical networks on multi-GPU systems Provide insight into NvidiaGPU architectures

  29. Conclusion Cortical network algorithm well suited to GPGPUs –34x speedup baseline / 39x with optimizations –Synchronization overhead / workload imbalance Combat with algorithmic changes Fermi vs. GTX 280 architecture –Application sensitive (32 vs. 128 threads) –Improved GigaThreadCTA scheduler in Fermi Multi-GPU implementation –Online profiling / deployment –60x speedup vs. serial

More Related