1 / 42

Efficient Complex Operators for Irregular Codes

Efficient Complex Operators for Irregular Codes. Jack Sampson , Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, Michael Bedford Taylor Department of Computer Science and Engineering University of California, San Diego. The Utilization Wall.

alaqua
Download Presentation

Efficient Complex Operators for Irregular Codes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Complex Operators for Irregular Codes Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, Michael Bedford Taylor Department of Computer Science and Engineering University of California, San Diego

  2. The Utilization Wall With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. [Venkatesh, Chakraborty]

  3. Classical scaling Device count S2 Device frequency S Device power (cap) 1/S Device power (Vdd) 1/S2 Utilization 1 Leakage limited scaling Device count S2 Device frequency S Device power (cap) 1/S Device power (Vdd) ~1 Utilization 1/S2 Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” The Utilization Wall [Venkatesh, Chakraborty]

  4. Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” The Utilization Wall 2x 2x 2x [Venkatesh, Chakraborty]

  5. Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” The Utilization Wall 3x 2x [Venkatesh, Chakraborty]

  6. Dealing with the Utilization Wall • Insights: • Power is now more expensive than area • Specialized logic has been shown as an effective way to improve energy efficiency (10-1000x) • Our Approach: • Use area for specialized cores to save energy on common apps • Can apply power savings to other programs, increasing throughput • Specialized coprocessors provide an architectural way to trade area for an effective increase in power budget • Challenge: coprocessors for all types of applications

  7. Specializing Irregular Codes • Effectiveness of specialization dependent on coverage • Need to cover many types of code • Both regular and irregular • What is irregular code? • Lacks easily exploited structure / parallelism • Found broadly across desktop workloads • How can we make it efficient? • Reduce per-op overheads with complex operators • Improve latency for serial portions

  8. Candidates for Irregular Codes • Microprocessors • Handle all codes • Poor scaling of performance vs. energy • Utilization wall aggravates scaling problems • Accelerators • Require parallelizable, highly structured code • Memory system challenging to integrate with conventional memory • Target performance over energy • Conservation Cores (C-Cores) [Venkatesh, et al. ASPLOS 2010] • Handle arbitrary code • Share L1 cache with host processor • Target energy over performance

  9. Conservation Cores (C-Cores) Automatically generated from hot regions of program source Hot code implemented by C-Core, cold code runs on host CPU Profiler selects regions C-to-Verilog compiler converts source regions to C-Cores Drop-in replacements for code No algorithmic changes required Software compatible in absence of available C-Core Toolchain handles HW generation/SW integration Hot code D cache C-Core Host CPU (general purpose) I cache [Venkatesh, et al. ASPLOS 2010] Cold code

  10. This Paper: Two Techniques for Efficient Irregular Code Coprocessors • Selective De-Pipelining (SDP) • Form long combinational paths for non-pipeline parallel codes • Run logic at slow frequency while improving throughput! • Challenge: handling memory operations • Cachelets • L1 access is a large fraction of critical path for irregular codes • Can we make a cache hit only 0.5 cycles? • Specialize individual loads and stores • Apply both to the C-Core platform • Up to 2.5x speedup vsan efficient in-order processor • Up to 22x EDP improvement

  11. Outline • Efficiency through specialization • Baseline C-Core Microarchitecture • Selective De-Pipelining • Cachelets • Conclusion

  12. Constructing a C-Core C-Cores start with source code Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required Example code for (i=0; i<N; i++) { x = A[i]; y = B[i]; C[x] = D[y] + x + y + x*y; }

  13. Constructing a C-Core C-Cores start with source code Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required BB0 BB1 BB2 CFG

  14. Constructing a C-Core C-Cores start with source code Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required BB1 + +1 + LD LD + <N? + LD + * + + ST DFG

  15. Constructing a C-Core (cont.) + + + LD LD * + + + LD + ST +1 • Schedule memory operations on L1 • Add pipeline registers to match host processor frequency <N?

  16. Observation + + + LD LD * + + + LD + ST +1 • Pipeline registers just for timing • No actual overlap in execution between pipeline stages <N?

  17. Outline • Efficiency through specialization • Baseline C-Core Microarchitecture • Selective De-Pipelining • Cachelets • Conclusion

  18. Meeting the Needs of Datapath and Memory • Datapath • Easy to replicate operators in space • Energy-efficient when operators feed directly to operators • Memory • Interface is inherently centralized • Performance-efficient when the interface can be rapidly multiplexed • Can we serve both at once?

  19. Constructing Efficient Complex Operators Direct mapping from CFG, DFG Produces large, complex operators (one per CFG node) BB0 CFG + + + + + + LD LD LD LD BB1 * * + + + + BB2 + + LD LD + + ST ST +1 +1 <N? <N? Complex Operator

  20. SDP addresses the needs of datapath and memory Fast, pipelined memory Slow, aperiodic datapath clock Selective De-Pipelining (SDP) clock Memory mux + + + LD + + + LD + + * + + + LD + ST + + +1 <N?

  21. Intra-basic-block registers on fast clock for memory Registers between basic blocks clocked on slow clock Selective De-Pipelining (SDP) clock Memory mux + + + LD + + + LD + + * + + + LD + ST + + +1 <N?

  22. Constructs large, efficient operators Combinational paths spanning entire basic block In-order memory pipelining, handles dependence Selective De-Pipelining (SDP) clock Memory mux + + + LD + + + LD + + * + + + LD + ST + + +1 <N?

  23. SDP Benefits • Reduced clock power • Reduced area • Improved inter-operator optimization • Easier to meet timing

  24. SDP creates more energy-efficient coprocessors SDP Results (EDP improvement)

  25. SDP Results (speedup) • New design faster than C-Cores, host processor • SDP most effective for apps with larger BBs

  26. Outline • Efficiency through specialization • Baseline C-Core Microarchitecture • Selective De-Pipelining • Cachelets • Conclusion

  27. Motivation for Cachelets • Relative to a processor • ALU operations ~3x faster • Many more ALU operations executing in parallel • L1 cache latency has not improved • L1 cache latency more critical for C-Cores • L1 access is 9x longer than an ALU op! • Can we make L1 accesses faster?

  28. Cache Access Latency • Limiting factor for performance • 50% of scheduling latency for last op on critical path • Closer caches could reduce latency • But must be very small

  29. Cachelets • Integrate into datapath for low-latency access • Several 1-4 line fully-associative arrays • Built with latches • Each services subset of loads/stores • Coherent • MEI states only (no shared lines) • Checkout/shootdown via L1 offloads coherence complexity

  30. Cachelet Insertion Policies • Each memory operation mapped to cachelet or L1 • Profile-based assignment • Two policies: Private and Shared • Fewer than 16 lines per C-Core, on average • Private: One operation per cachelet • Average of 8.4 cachelets per C-Ccore • Area overhead of 13.4% • Shared: Several operations per cachelet • 6.2 cachelets per C-Ccore, • Average sharing factor of 10.3 • Area overhead of 16.8%

  31. CacheletImpactonCriticalPath • Provide majority of utility of full sized L1 at cachelet latency • Improve EDP – reduction in latency worth the energy

  32. Cachelet Speedup over SDP • Benefits of cachelets depend on application • Best when there are several disjoint memory access streams • Usually deployed for spatial rather than temporal locality

  33. C-Cores with SDP and Cachelets vs. Host Processor • Average speedup of 1.61x over in-order host processor

  34. C-Cores with SDP and Cachelets vs. Host Processor • 10.3x EDP improvement over in-order host processor

  35. Conclusion • Achieving high coverage with specialization requires handling both irregular and regular codes • Selective De-Pipelining addresses the divergent needs of memory and datapath • Cachelets reduce cache access time by a factor of 6 for subset of memory operations • Using SDP and cachelets, we provide both 10.3x EDP 1.6x performance improvements for irregular code

  36. Backup Slides

  37. Application Level, with both SDP and Cachelets • 57% EDP reduction over in-order host processor

  38. Application Level, with both SDP and Cachelets • Average application speedup of 1.33x over host processor

  39. SDP creates more energy-efficient coprocessors SDP Results (Application EDP improvement)

  40. SDP Results (Application Speedup) • New design faster than C-Cores, host processor • SDP most effective for apps with larger BBs

  41. Cachelet Speedup over SDP (Application Level) • Benefits of cachelets depend on application • Best when there are several disjoint memory access streams • Usually deployed for spatial rather than temporal locality

More Related