Efficient Complex Operators for Irregular Codes

Efficient Complex Operators for Irregular Codes Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, Michael Bedford Taylor Department of Computer Science and Engineering University of California, San Diego

The Utilization Wall With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. [Venkatesh, Chakraborty]

Classical scaling Device count S2 Device frequency S Device power (cap) 1/S Device power (Vdd) 1/S2 Utilization 1 Leakage limited scaling Device count S2 Device frequency S Device power (cap) 1/S Device power (Vdd) ~1 Utilization 1/S2 Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” The Utilization Wall [Venkatesh, Chakraborty]

Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” The Utilization Wall 2x 2x 2x [Venkatesh, Chakraborty]

Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” The Utilization Wall 3x 2x [Venkatesh, Chakraborty]

Dealing with the Utilization Wall • Insights: • Power is now more expensive than area • Specialized logic has been shown as an effective way to improve energy efficiency (10-1000x) • Our Approach: • Use area for specialized cores to save energy on common apps • Can apply power savings to other programs, increasing throughput • Specialized coprocessors provide an architectural way to trade area for an effective increase in power budget • Challenge: coprocessors for all types of applications

Specializing Irregular Codes • Effectiveness of specialization dependent on coverage • Need to cover many types of code • Both regular and irregular • What is irregular code? • Lacks easily exploited structure / parallelism • Found broadly across desktop workloads • How can we make it efficient? • Reduce per-op overheads with complex operators • Improve latency for serial portions

Candidates for Irregular Codes • Microprocessors • Handle all codes • Poor scaling of performance vs. energy • Utilization wall aggravates scaling problems • Accelerators • Require parallelizable, highly structured code • Memory system challenging to integrate with conventional memory • Target performance over energy • Conservation Cores (C-Cores) [Venkatesh, et al. ASPLOS 2010] • Handle arbitrary code • Share L1 cache with host processor • Target energy over performance

Conservation Cores (C-Cores) Automatically generated from hot regions of program source Hot code implemented by C-Core, cold code runs on host CPU Profiler selects regions C-to-Verilog compiler converts source regions to C-Cores Drop-in replacements for code No algorithmic changes required Software compatible in absence of available C-Core Toolchain handles HW generation/SW integration Hot code D cache C-Core Host CPU (general purpose) I cache [Venkatesh, et al. ASPLOS 2010] Cold code

This Paper: Two Techniques for Efficient Irregular Code Coprocessors • Selective De-Pipelining (SDP) • Form long combinational paths for non-pipeline parallel codes • Run logic at slow frequency while improving throughput! • Challenge: handling memory operations • Cachelets • L1 access is a large fraction of critical path for irregular codes • Can we make a cache hit only 0.5 cycles? • Specialize individual loads and stores • Apply both to the C-Core platform • Up to 2.5x speedup vsan efficient in-order processor • Up to 22x EDP improvement

Outline • Efficiency through specialization • Baseline C-Core Microarchitecture • Selective De-Pipelining • Cachelets • Conclusion

Constructing a C-Core C-Cores start with source code Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required Example code for (i=0; i<N; i++) { x = A[i]; y = B[i]; C[x] = D[y] + x + y + x*y; }

Constructing a C-Core C-Cores start with source code Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required BB0 BB1 BB2 CFG

Constructing a C-Core C-Cores start with source code Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required BB1 + +1 + LD LD + <N? + LD + * + + ST DFG

Constructing a C-Core (cont.) + + + LD LD * + + + LD + ST +1 • Schedule memory operations on L1 • Add pipeline registers to match host processor frequency <N?

Observation + + + LD LD * + + + LD + ST +1 • Pipeline registers just for timing • No actual overlap in execution between pipeline stages <N?

Meeting the Needs of Datapath and Memory • Datapath • Easy to replicate operators in space • Energy-efficient when operators feed directly to operators • Memory • Interface is inherently centralized • Performance-efficient when the interface can be rapidly multiplexed • Can we serve both at once?

Constructing Efficient Complex Operators Direct mapping from CFG, DFG Produces large, complex operators (one per CFG node) BB0 CFG + + + + + + LD LD LD LD BB1 * * + + + + BB2 + + LD LD + + ST ST +1 +1 <N? <N? Complex Operator

SDP addresses the needs of datapath and memory Fast, pipelined memory Slow, aperiodic datapath clock Selective De-Pipelining (SDP) clock Memory mux + + + LD + + + LD + + * + + + LD + ST + + +1 <N?

Intra-basic-block registers on fast clock for memory Registers between basic blocks clocked on slow clock Selective De-Pipelining (SDP) clock Memory mux + + + LD + + + LD + + * + + + LD + ST + + +1 <N?

Constructs large, efficient operators Combinational paths spanning entire basic block In-order memory pipelining, handles dependence Selective De-Pipelining (SDP) clock Memory mux + + + LD + + + LD + + * + + + LD + ST + + +1 <N?

SDP Benefits • Reduced clock power • Reduced area • Improved inter-operator optimization • Easier to meet timing

SDP creates more energy-efficient coprocessors SDP Results (EDP improvement)

SDP Results (speedup) • New design faster than C-Cores, host processor • SDP most effective for apps with larger BBs

Motivation for Cachelets • Relative to a processor • ALU operations ~3x faster • Many more ALU operations executing in parallel • L1 cache latency has not improved • L1 cache latency more critical for C-Cores • L1 access is 9x longer than an ALU op! • Can we make L1 accesses faster?

Cache Access Latency • Limiting factor for performance • 50% of scheduling latency for last op on critical path • Closer caches could reduce latency • But must be very small

Cachelets • Integrate into datapath for low-latency access • Several 1-4 line fully-associative arrays • Built with latches • Each services subset of loads/stores • Coherent • MEI states only (no shared lines) • Checkout/shootdown via L1 offloads coherence complexity

Cachelet Insertion Policies • Each memory operation mapped to cachelet or L1 • Profile-based assignment • Two policies: Private and Shared • Fewer than 16 lines per C-Core, on average • Private: One operation per cachelet • Average of 8.4 cachelets per C-Ccore • Area overhead of 13.4% • Shared: Several operations per cachelet • 6.2 cachelets per C-Ccore, • Average sharing factor of 10.3 • Area overhead of 16.8%

CacheletImpactonCriticalPath • Provide majority of utility of full sized L1 at cachelet latency • Improve EDP – reduction in latency worth the energy

Cachelet Speedup over SDP • Benefits of cachelets depend on application • Best when there are several disjoint memory access streams • Usually deployed for spatial rather than temporal locality

C-Cores with SDP and Cachelets vs. Host Processor • Average speedup of 1.61x over in-order host processor

C-Cores with SDP and Cachelets vs. Host Processor • 10.3x EDP improvement over in-order host processor

Conclusion • Achieving high coverage with specialization requires handling both irregular and regular codes • Selective De-Pipelining addresses the divergent needs of memory and datapath • Cachelets reduce cache access time by a factor of 6 for subset of memory operations • Using SDP and cachelets, we provide both 10.3x EDP 1.6x performance improvements for irregular code

Backup Slides

Application Level, with both SDP and Cachelets • 57% EDP reduction over in-order host processor

Application Level, with both SDP and Cachelets • Average application speedup of 1.33x over host processor

SDP creates more energy-efficient coprocessors SDP Results (Application EDP improvement)

SDP Results (Application Speedup) • New design faster than C-Cores, host processor • SDP most effective for apps with larger BBs

Cachelet Speedup over SDP (Application Level) • Benefits of cachelets depend on application • Best when there are several disjoint memory access streams • Usually deployed for spatial rather than temporal locality

Efficient Complex Operators for Irregular Codes

Efficient Complex Operators for Irregular Codes

Presentation Transcript

Efficient Complex Shadows from Environment Maps

Efficient Metadata Management for Irregular Data Prefetching

Operators

Energy Efficient Building Codes

Efficient Crawling of Complex Rich Internet Applications

Hardware accelerator for Efficient error-correcting codes

Energy Efficient Operations: Giving facility operators the tools

Genetic Operators for TSP

Efficient Complex Query Support For Multi-version XML Documents

Efficient Support for All Levels of Parallelism for Complex Media Applications

Improved Progressive-Edge-Growth (PEG) Construction of Irregular LDPC Codes

Practice for logical operators

Operators

RFFS requirements for operators

Operators

SYNTHESIS OF ENERGY EFFICIENT COMPLEX SEPARATION NETWORKS

OBJECTIVES FOR APPARATUS OPERATORS

Efficient Streaming of 3D Scenes with Complex Geometry and Complex Lighting

SMS for Small Operators

irregular

Operators for Similarity Search

Reason for irregular flow