240 likes | 444 Views
CS 395 Last Lecture Summary, Anti-summary, and Final T houghts. Summary (1) Architecture. Modern architecture designs are driven by energy constraints Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput
E N D
CS 395 Last LectureSummary, Anti-summary, and Final Thoughts
Summary (1) Architecture • Modern architecture designs are driven by energy constraints • Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput • Some parallelism is implicit (out-of-order superscalar processing,) but have limits • Others are explicit (vectorization and multithreading,) and rely on software to unlock
Summary (2) Memory • Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other • Locality (relationships between memory accesses) can help us get the best of all cases • Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.)
Summary (3) Software • Want to fully occupy your hardware? • Express locality (tiling) • Vectorize (compiler or manual) • Multithread (e.g. OpenMP) • Accelerate (e.g. CUDA, OpenCL) • Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free.
Research Perspective (2010) • Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations? • Across multiple architectures • Across many applications • What kinds of performance trends are we seeing from successive GPU generations? • Conclusion – GPUs aren’t special, and parallel programming is getting easier
Application Survey • Surveyed the GPU Computing Gems chapters • Studied the Parboil benchmarks in detail Results: • Eight (for now) major categories of optimization transformations • Performance impact of individual optimizations on certain Parboil benchmarks included in the paper
1: (Input) Data Access Tiling DRAM DRAM DRAM ImplicitCopy ExplicitCopy Cache Scratchpad LocalAccess LocalAccess
2. (Output) Privatization • Avoid contention by aggregating updates locally • Requires storage resources to keep copies of data structures Private Results Local Results Global Results
Running Example: SpMV x v Ax = v Row Col Data A
Running Example: SpMV x v Ax = v Row Col Data A
3. “Scatter to Gather” Transformation x v Ax = v Row Col Data A
3. “Scatter to Gather” Transformation x v Ax = v Row Col Data A
8. Granularity Coarsening Time 4-way parallel Redundant 2-way parallel Essential • Parallel execution often requires redundant and coordination work • Merging multiple threads into one allows reuse of result, reducing redundancy
How much faster do applications really get each hardware generation?
Unoptimized Code Has Improved Drastically • Orders of magnitude speedup in many cases • Hardware does not solve all problems • Coalescing (lbm) • Highly contentious atomics (bfs)
Optimized Code Is Improving Faster than “Peak Performance” • Caches capture locality scratchpad can’t efficiently (spmv, stencil) • Increased local storage capacity enables extra optimization (sad) • Some benchmarks need atomic throughput more than flops (bfs, histo)
Optimization Still Matters • Hardware never changes algorithmic complexity (cutcp) • Caches do not solve layout problems for big data (lbm) • Coarsening still makes a big difference (cutcp, sgemm) • Many artificial performance cliffs are gone (sgemm, tpacf, mri-q)
Stuff we haven’t covered • Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters. • Patterns and practice • Some of the major patterns of optimization we covered, but only the basic ones. Many optimization patterns are algorithmic.