Efficient Execution of Single-thread Programs across Multiple Cores

Efficient Execution of Single-thread Programs across Multiple Cores
Behnam Robatmili Supervisor: Doug Burger UT Austin July 21, 2011

Need for Efficiency and Flexibility AMD Llano (Fusion) Single thread efficiency and scalability with multicore Amdahl’s law Power wall limits frequency scaling and efficiency Parallel efficiency is only possible with efficiency of each thread at each execution point Need more efficient methods to span performance/energy Heterogeneous multicores and DVFS for efficiency and flexibility Various ISAs and design overheads Not as flexible as we want Need more innovative solutions! Intel Sandy Bridge 2

One Alternative: Dynamic (Composable) Multicore For each thread, multiple simple cores Share resource and form a more powerful core Span a wide range of energy/performance operating points Potentially they can achieve high performance with low energy budget and also operate in a low-power regime Inter-core control communication L1 L1 Inter-core data communication RF RF L1 RF BP BP BP L1 L1 RF RF BP BP

Handling Distributed Dependences CoreFusion [Micro07] and WiDGET [ISCA11] Dynamically distributed execute across multiple cores Need for power-hungry central units for maintaining control sequence and register renaming across distributed instructions With ISA support compiler can reduce these overheads (EDGE) Inter-core control communication L1 L1 Inter-core data communication RF RF L1 RF BP BP BP L1 L1 RF RF BP BP

RISC EDGE Atomic unit L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 L1: L2: add mul ld sub tlt jlt L2 fadd fmul ld sd jump L1 R1 R0 ld Register File ld muli Atomic unit muli add br sd add R0 EDGE ISAs Block atomic execution (predicated blocks) Instruction groups fetch, execute, and commit atomically Direct instruction communication (Dataflow) Explicitly encode dataflow graph by specifying targets Enables efficient execution and low-overhead distribution

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Ultimate Goal RF L1 Grid of many homogeneous thin low-power, high performance cores connected via on-chip mesh network Single thread efficiency (energy-delay-product):Logical cores are composed of multiple physical cores in a scalable and low-overhead way Multithread efficiency: Logical cores can be composed / decomposed at runtime according to runtime policies one of which is composition ALU BP L2 L2 L2 L2 T0 T1 c c c c T0 T1 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 T2 c c c c T2 T3 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 T4 T5 c c c c T4 T5 L2 L2 L2 L2 L2 L2 L2 L2 Compose Recompose L2 L2 L2 L2 c c c c L2 L2 L2 L2 L2 L2 L2 L2 T7 L2 L2 L2 L2 T6 c c c c T6 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 c c c c L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 T3 c c c c T7 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 c c c c L2 L2 L2 L2 L2 L2 L2 L2

Thesis Statement Composable processors can potentially perform well in different power/energy regimes Design a power-efficient and scalable dynamic multicore called T3 spanning a wide energy/performance spectrum Evaluating prior EDGE composable design (TFlex) By re-inventing EDGE architectures to address inefficacies and improve execution efficiency in different power/performance regimes

Outline Motivation Background on TFlex and bottleneck analysis Re-inventing architecture for power efficient EDGE composable execution Mapping blocks across simple cores Optimizing cross-core register communication Reducing fetch stall bottleneck Optimizing prediction and predication Optimizing single-core operand delivery Evaluation and conclusions

Baseline TFlexComposable System R D R D R D R D BP ALU ALU BP ALU BP BP ALU R D R D R R D D ALU ALU BP BP ALU ALU BP BP R D R D R D R D ALU ALU ALU BP BP BP ALU BP R TFlex is an EDGE composable processor N cores are merged Share registers, branch tables, cache… Run N blocks in parallel (1 nonspec) Dataflow within blocks and shared distributed registers across blocks D R D R D R D ALU ALU BP BP BP ALU BP ALU TFlex 2-wide issue execution per core

T3 Composable Processor OneT3 Core R D R D R D R D BP ALU ALU BP ALU BP BP ALU Block control & reissue unit Register bypassing R D R D R R D D 8-Kbit block/predicate predictor 2 ALU ALU BP BP ALU ALU BP BP R D R D R D R D ALU ALU ALU BP BP BP ALU BP Broadcast/token select logic Broadcast/token select logic Block mapping unit R Analysis using systematic bottleneck analysis based on critical path analysis [HPCA11] For each step, redesign architecture/ISA to reduce the most dominant execution bottleneck Primarily aiming for performance Using EDGE semantics to save energy D R D R D Spec. predicate R D ALU ALU BP BP BP ALU BP ALU T3 2-wide issue execution per core

Systematic Bottleneck Analysis and Reduction Analyzing complex distributed systems is complicated Our methodology Using a system-level critical path analysis to detect top bottleneck At a component level, detect the scenario causing that bottleneck Design the right optimization mechanism based on the detected scenario and repeat Bottlenecks and mechanisms presented in the order detected System-level analysis Detecting Bottleneck Component Component-level analysis Detecting the scenario causing the bottleneck Choose and apply the right optimization mechanism

Outline Motivation Background Re-inventing architecture for power efficient EDGE composable execution Mapping blocks across simple cores Optimizing cross-core register communication Reducing fetch stall bottleneck Optimizing prediction and predication Optimizing single-core operand delivery Full T3 system evaluation and conclusions

Reducing Fine-Grain Dataflow Communication Flat mapping (Original) Each core runs a portion of each running block Intra-block dataflow communication is a bottleneck [LCPC08, PACT08] Deep mapping [MICRO08] Maps each block to one core Halves cross-core communication by limiting dataflow to single cores Dynamically select the cores for mapping blocks Flat Mapping Deep Mapping

Reducing Coarse-Grain Register Communication Cross-core register communication between blocks turns to bottleneck Distributed registers forwarding units resolve register dependences Selective register bypassing: Late values via very low-overhead direct register value bypassing while the rest use register forwarding [HPCA11] B0 to B1 via register R1 B0 Home core of R1 D D D D B1 R R R R Selective register bypassing for critical register values (direct) Original register value forwarding via register home core (indirect)

Reducing Fetch Criticality Block fetches following flushes are critical There is no control flow with each block so blocks can be reissued if they are still in the window Reissuing critical instructions following pipeline flushes to reduce misspeculation penalty [HPCA11] Saves energy & delay by reducing fetches and decode by about 50%

Next Block Prediction To scale performance, multiple speculative blocks in flight Coarse-grain branch prediction When a block is fetched the block following it is predicted B0 B1 B2 B4 B3 B5 B6 Predicted block path B0, B1, B4, B5, B6

EDGE Speculation & Predication Overheads Intra-block control points converted to predicates Multi-exit next block prediction accuracy (i.e. Exit 1..3) The branch history does not include predicates (i.e. i1 and i3) TFlex predictor predicts the exit ID bits of the block Predicates executed not predicted (i.e. ST waits for R1) B1 R1 R2 1 B2 TZ B3 i1 i2 i3 SUBI BR B2 TZ ST i4 Exit 3 Exit 2 On false BR B3 BR B1 Exit 1 On True

Iterative Path Prediction (IPP) Solution [submitted to MICRO11]: Predict predicate path within each block Use it to better predict exit and speculate on predicates Example: i1i3 path: “11”  Take Exit 2, skip all instructions “00”  Take Exit 1, only execute ST B1 R1 R2 1 B2 TZ B3 i1 SUBI i2 i3 BR B2 TZ ST i4 Exit 3 Exit 2 On false BR B3 BR B1 Exit 1 On True

IPP Advantages Accurate next block prediction Speculative execution of predicates Current block address Next Block Target Predicted Branches in the block Predicting next block Predicate Prediction Target Prediction Speculatively executing dataflow predicate path in the block

IPP Predicate Predictor Component Pipelined OGEHL [Seznec JILP04] predictor Only one hashing stage for all instructions in the block High accuracy using hazard elimination 1-bit prediction Spec update 4-bit counters TO L(0) L(1) GHR T1 H1 200 bits L(2) ∑ T2 H1 Predicted path L(3) T3 H2 L(4) T4 H3 H4 7-bit indexes 40 bits Block PC Prediction sum Initial index compute Table access

Tuning IPP Design parameter: # of predicted predicates/block Accuracy of next block prediction Accuracy of predicate prediction Speedup Optimum point: Predicting 4 predicates per block 14% improved performance (16 merged cores) 11% from predicate prediction + 3% from better next block prediction

Outline Motivation and Background Rethinking compiler and hardware to efficiently exploit thin EDGE cores Distributing computation across simple cores Optimizing cross-core communication Reducing fetch stall bottleneck Optimizing prediction and predication Optimizing single-core operand delivery Full T3 system evaluation and conclusions

EDGE Dataflow High-Fanout Issue Using EDGE dataflow, each instruction can encode up to 2 targets Efficient for low-fanout Trees of move instructions inserted by compiler for high fanout operands (20% of all instructions!) Dynamically generated and matched broadcast tags and bypass networks used by out of order machines High power consumption Not efficient for low-fanout operands R1 mov 5 MULI ADD MUL ADD mov ADD ST BR B2

Architecturally Exposed Operand Broadcast (EOB) Joint work with Dong Li (lead author) [PESPMA09] For low-fanout operands uses dataflow For high-fanout operands uses light-weight exposed operand broadcasts Simple microarchitectural support Source and destination EOBs (tags) are explicitly assigned to instructions Assigned statically, resolved dynamically Most moves eliminated and 5% fewer blocks executed (10% less energy) R1 1 5 1 MULI 1 ADD 1 2 MUL 2 ADD 2 ADD 2 ST BR B2 Exposed operand broadcast Dataflow An interesting compiler problem to select instructions for limited EOBs

Summary of Contributions Communication (Back end) Speculation (Front end) Non-critical Next Block Prediction [Trace prediction ] Block Reissue [Trace cache, Inst. reissue] Critical Multi core Cross-core Register Forwarding Units [Distributed memory] Direct Register Bypassing [TLS, LSQ bypassing] Distributing execution using Block Mapping Predicate Path Prediction [Predicate prediction for out of order] Dataflow [Forwardflow, Accelerators] Exposed Broadcasts [Forwardflow, Hybrid wakeup] Single core Low fanout High fanout

Simulation Setup Accurate delay and power analysis comparison with TRIPS and TFlex Cycle/power simulator validated against TRIPS hardware Power models validated against real hardware and RTL [IEEE Computer, in revision] For memory models, used CACTI Technology: 45nm, Vdd: 1.1 Volt, Frequency: 2.4 GHz Comparison with Core2 and Atom platforms in different DVFS regions Real hardware reported power and performance results reported by H. Esmeilzadeh [ASPLOS11] Estimated L1 + Core Power using McPAT [ISCA10]

SPEC INT Performance/Energy Results Performance Energy T3 # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Pollack’s TFlex-8 is close to TRIPS while T3-8 outperforms TRIPS by 1.43 and 25% less energy

Results Breakdown Major delay savers: IPP, block mapping and block reissue Major energy savers: EOBs, block mapping and block reissue Performance Energy # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores

SPEC INT Cross-Platform Comparison P Few cores (1 to 2)  Energy efficient with high performance More cores (4 to 8)  Increased performance for low energy cost P Performance Energy E E # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Efficiently covering much larger operation spectrum than DVFS

SPEC FP Performance/Energy Results Energy Performance # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Pollack’s

SPEC FP Cross-Platform Comparison P E Energy Performance P E # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Significantly improved performance and energy efficiency compared to INT

Summary of Bottleneck Analysis Branches & predicates Fetch Fine-grain dataflow communication Coarse-grain register communication

Related Work Distributed uniprocessors Dynamic CoreFusion[ISCA08], Forwardflow & WiDGET[ISCA10] Static Instruction Level Distributed Processing [ISCA02], Wavescalar [MICRO03] , Multiscalar [ISCA95], TLS [IEEE Comp. 99] Efficiency optimization Instruction mapping RAW [IEEE Comp. 97], Clustered superscalar [ISCA03] Instruction Reuse Trace processors [MICRO96], Instruction revitalization [MICRO03] Register/Memory Bypassing Memory bypassing and cloaking [IJPP99], TLS synch. scoreboard [IJPP03] Critical path analysis Original paper [ISCA01], TRIPS criticality analysis [ISPASS06]

Conclusions Rethinking traditional execution models for better efficiency and flexibility for future systems This study: Achieving an efficient EDGE composable system Methodology Systematic bottleneck analysis Balancing communication and execution by specializing communication and speculation at different levels of hierarchy Achieve the right division between EDGE hardware and software Further optimizations are still possible (E2 system) Better code quality Instruction packing (variable block sizes and SIMD/Vector instructions) How composability can improve multithread efficiency?

Publications

Acknowledgements My advisor: Doug Burger My committee members: Kathryn McKinley, Steve Keckler, Calvin Lin and Steve Reinhardt My collaborators: Katie Coons, Bert Maher, Aaron Smith [Compiler], JeﬀDiamond [TRIPS BLAS],Dong Li [EOBs], HadiEsmaeilzadeh[IPP] and SibiGovindan[Power] Other colleagues for their significant comments and advice: Boris Grot, Mark Gebhart … CART Lab and Speedway Group UTCS and MSR

Thank you T3 TRIPS TFlex Efficiency (Future technology) Flexibility (Little pieces merging) Flexibility + Efficiency Terminator/EDGE Analogy

Backup Slides

List of Backup Slides Compiler Comparison plat forms Energy-delay Product Power Breakdown E2 uARCH Suppor and Tuning for EOBs TRIPS vs. TFlex Iterative Path Predictor Block Mapping EDGE Background Next Block Prediction Issue Block Reissue Limit Study & Research Interests Final Criticality Results

Limit Study Chances for additional uArch improvements Under ideal speculative execution Perfect predicate, dependence and branch prediction Only when all applied T3 observes significant speedup

Research Interests Redesigning computer system for Efficiency, Power, Security, Resiliency Redesign the entire hw/sw stack layers wrt workload and factor under optimization Synchronization & communication according to workload More systematic and intelligent ways to redesign systems Machine learning & criticality analysis for specializing important operational modes New technologies: NVRAMs, nano, etc New workloads: Data centric, Augmented Reality, NUI, gaming, Cloud, etc Application Factor under optimization: Power, performance, Security, Resiliency Programming Language Optimization method OS and Runtime Architecture (ISA) uArchitecture Circuit

Systematic Bottleneck Analysis and Reduction System-level analysis Detecting Bottleneck Component Component-level analysis Detecting the scenario causing the bottleneck Choose and apply the right optimization mechanism

Block Reissue Reducing misspeculation penalty [HPCA11] Reissuing critical instructions following pipeline flushes Saves energy & delay by reducing fetches by about 50% Percent of the reissued blocks

Introduction to E2 128 entry instruction window Divided into 4 lanes Each lane can execute an independent (32 instruction) hyperblock Vector instruction onlytarget instructions in the same lane 64 general-purpose registers 32kB L1 instruction and Dcaches Control Lane 1 Instruction Window 32 x 54b Registers[0-15] 16 x 64b Operand Buffer 32 x 64b Operand Buffer 32 x 64b ALU L1 Instruction Cache 32 KB Lane 2 Registers [16-31] 16 x 64b Instruction Window 32 x 54b Operand Buffer 32 x 64b Operand Buffer 32 x 64b ALU Lane 3 Instruction Window 32 x 54b Registers [32-47] 16 x 64b Operand Buffer 32 x 64b Operand Buffer 32 x 64b ALU Unordered Load/Store Queue Lane 4 Instruction Window 32 x 54b Registers [48-63] 16 x 64b Operand Buffer 32 x 64b Operand Buffer 32 x 64b ALU Branch Predictor L1 Data Cache 32 KB Memory Interface Controller

MSR E2 Dynamic Multicore System Variable-size blocks and SIMD/vector operations in addition to T3 optimizations 4 lanes each with an ALU, one bank of (inst. window, register file and operand buffer) Support up to 4 blocks per core and fine-grained SIMD/vector operations

Generating Large Blocks Hyperblock Flow Graph Compiler can generate larger blocks by converting control dependences to data dependences through predication Each color represents a basic block RD, WR: inter block communications Dataflow inter block communication: RD RD HB1 tstlti RD f t - + HB2 HB3 * + ST tstz HB4 WR BR2 f t WR BR1

Scale Compiler for TRIPS Hyperblockformation If-conversion Loop peeling While loop unrolling Predicate optimizations RD RD tstlti RD f t - + Register Allocation Splitting for Spill Code * + ST tstz Scheduling (placement) WR BR2 f t WR BR1 Dataflow code generation Optimizing compiler for TRIPS that compiles all SPEC benchmarks

Challenges to be Addressed Predicted Block Path Executing a stream of blocks Oldest to youngest (commit order) Fundamental challenges How instructions are mapped to hardware? How instructions should communicate? How to support inter- and intra-block speculation? Division between the compiler and hardware Reg File Hardware Reg File …

Hierarchical Communication Model Communication (Back end) Speculation (Front end) Non-critical Critical Inter block Intra block Low fanout High fanout

Compiling for EDGE Hyperblock Flow Graph Hyperblockformation If-conversion Loop peeling While loop unrolling Predicate optimizations RD2 HB1 tstlti RD1 f t - Register Allocation Splitting for Spill Code HB2 HB3 * BR2 + tstz BR1 BR3 Dataflow code generation t f WR2 WR1 HB4 dataflow Predicate Block exit branches (control) Block register outputs (data) Block register inputs (data) Block exit branches (control)

Need for Efficiency and Flexibility End of Denard scaling Over 5 technology generations (2024), only 7.9x speedup is possible (CPU or GPU) At 8nm, 50% of chip will not be utilized [ISCA11] Need more efficient cores using radical architectural innovations Save delay and power together Maximize efficiency for each and across threads Supporting future workloads without heterogeneous ISA overheads Nehalem Llano Tigra

Support for Exposed Operand Broadcast Send BCID = 001 Type = op1 operand 2 p target1 op2 operand 1 target2 op op1 issued BC CAM a i1 ✓ ✓ ✓ SBCID=1 add b (BCID, type, value) ✓ ✓ i5 , op2 i2 001 sub d ✓ i5 , op1 i3 ✓ 001 add g a Send BCID [2-0] BC CAM ✓ i4 001 st b RBCID [2-0] 3 i5 B 000 st RBCIDv 3 Issued = = = match

Tuning EOBs Design parameters: Number of EOBs Larger number of available EOBs more moves removed but wider EOB CAMs used by the bypass network Optimum point: 8 EOBs (3-bit wide) for minimum overhead

Legend Static Scheduling Overview Hyperblockformation If-conversion Loop peeling While loop unrolling Predicate optimizations R2 Static Placement, Dynamic Issue add mul br ld ld D0 ctrl D1 R1 ctrl R2 R1 R1 mul Register Allocation Splitting for Spill Code D0 add mul Dataflow Graph add D1 ld mul add W1 br Scheduling (placement) R1 ctrl R2 R1 D0 Placement D1 Register Data cache Execution Control Topology 128! Scheduling possibilities

INT Performance/Energy Results T3 # of cores # of cores Energy consumed over single dual-issue cores Speedup over single dual-issue cores Pollack’s

Results Breakdown # of cores # of cores Energy consumed over single dual-issue cores Speedup over single dual-issue cores

FP Performance/Energy Results # of cores # of cores Speedup over single dual-issue cores Energy consumed over single dual-issue cores Pollack’s

Need for Efficiency and Flexibility End of multicore scaling! Moore’s law end at most after 5 generations (8nm) By then (8nm in 2024), only 4x to 7.9x speedup is possible using multicores (CPU or GPU) 10% to 50% of chip will not be utilized [Esmaeilzadeh ISCA11] Need more efficient cores using radical architectural innovations Save delay and power together Maximize efficiency for each thread and across threads H. Esmaeilzadeh [ISCA11]

EDGE uArchitectures Timeline TRIPS (UT) Distributed EDGE uArch TFlex (UT) Fully distributed registers, caches, control T3 (UT) Reinvented uArch and ISA for power &performance efficiency E2 (MSR) Support for variable block sizes and SIMD/vector mode ISA uArch Reinvention Evaluation Initial Implementation 00 05 10 15 Scope of the talk

Static vs Dynamic Dependences Hyperblock Flow Graph Hyperblockformation If-conversion Loop peeling While loop unrolling Predicate optimizations RD2 HB1 tstlti RD1 f t - Register Allocation Splitting for Spill Code HB2 HB3 * BR2 + tstz BR1 BR3 f Dataflow code generation t WR2 WR1 HB4 Up to 128 instructions in each block Dynamically detected inter-block Statically detected intrablock Dataflow link (data) Predicated on true (control) Predicated on false (control) Block register outputs (data) Block register inputs (data) Block exit branches (control)

R0 R1 R2 R3 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Microarchitectural Topology Register File R D R D R D R D BP BP ALU ALU BP ALU BP BP ALU R D R D R R D D D0 ALU ALU BP BP ALU ALU BP BP D1 R D R D R D R D Data Cache ALU ALU ALU BP BP BP ALU BP D2 R D R D R D R D D3 ALU ALU BP BP BP ALU BP ALU TRIPS TFlex, T3 and E2 1-wide issue execution per E tile 2-wide issue execution per core

Original Next Block Prediction (Exit IDs)

Iterative Path Predictor used by T3

Basic OGEHL predictor Has relatively long delay N predicate bits  delay: 3N cycles Spec update 1-bit prediction Spec update T0 T1 GHR H1 L(0) 200 bits ∑ T2 H1 L(1) Predicted path T3 H2 L(2) T4 H3 L(3) 8-bit indexes 4-bit counters H4 L(4) 40 bits Block PC

Pipelined OGEHL Predictor Improved delay: N + 2 cycles Possible hazards Hazards 1-bit prediction Spec update 4-bit counters L(0) T0 L(1) T1 GHR H1 200 bits T2 L(2) ∑ H1 Predicted path L(3) T3 H2 L(4) T4 H3 H4 8-bit indexes 40 bits Block PC Initial index compute Table access Prediction sum

Pipelined with Bypassing (2w Tables) 1-bit prediction Spec update 4-bit counters TO L(0) L(1) GHR T1 H1 200 bits L(2) ∑ T2 H1 Predicted path L(3) T3 H2 L(4) T4 H3 H4 7-bit indexes 40 bits Block PC Prediction sum Initial index compute Table access

Aggressive Pipeline (7w tables) Low delay: N/3 + 2 cycles More complex logic Possible aliasing 3-bit prediction Spec update 7 x 4-bit counters ∑ L(0) T0 L(1) T1 GHR H1 200 bits ∑ ∑ L(2) T2 1 0 H1 Predicted path L(3) T3 H2 ∑ ∑ ∑ ∑ 11 L(4) T4 10 00 10 H3 H4 5-bit indexes 40 bits Block PC Prediction sum Initial index compute Table access

Iterative Path Predictor Results Improves branch prediction accuracy MPKI goes from 4.3 to 3.6 Number of flushed blocks goes from 12% to 8% 5% energy saving for each core Speedup execution by speculating on predicate paths 12% speedup when using 16 cores 98% accurately for path prediction

Next Block Prediction Predicting next block Multiple branches per block The taken branch depends on executed predicate path Target correlation Exit IDs vs. predicate path Predict taken predicate path and use it to predict next block Speculatively execute predicate path RD2 tstlti RD1 f t - * BR2 + tstz BR1 f BR3 t WR2 WR1 Predicted block path Taken predicated path

Using EDGE for Efficiency and Composition EDGE ISA and uArchitectures promise composition Efficiency: dataflow and block atomicity Flexibility: Using distributed microarchitectures Looking back into early EDGE designs and revisit the basics using a systematic methodology Proposing a new design to fulfill these goals Systematic bottleneck analysis and removal (not covered) Design space exploration for power efficiency Balancing computation and communication in a hierarchical manner: Distribution, Communication Speculation, operand delivery Balanced division of compiler and hardware Complete power, performance and scalability evaluation

Platform Comparison Parameters Atom area: 8.58 mm2 Core 2 area: 22.4 mm2 T3 one core area: 2.5 mm2

Platform Comparison Parameters (65nm)

Energy Delay Efficiency

Energy Delay Efficiency 2

Power Breakdown

Mapping Computation on Thin Cores Flat Mapping 2 blocks 1 block IQ IQ IQ IQ IQ IQ D D D D D D D R R R R R R R IQ Used by early EDGE designs to map speculative dataflow blocks 4 blocks

Static Placement (Flat Mapping) Aims to exploit intra-block parallelism using 1-wide cores Applied to one block at a time 128!possible schedules More complicated considering register locations and compiler phases (block generation and register allocation) Good heuristicsbased on estimated critical path within the block Machine learning can help [LCPC08,PACT08] Is this the right solution? Hard to achieve a global solution! Observation: Intra-block communication is a bottleneck

Reducing Communication Overheads Deep Mapping 1 block IQ IQ IQ IQ IQ IQ IQ D D D D D D D R R R R R R R 2 blocks Maybe a better choice for slightly stronger cores 4 blocks

Dynamic Placement (Deep Mapping) Aims to maximize inter-block parallelism [MICRO08] Intra-block parallelism restricted by the issue width of each core Saving cross-core communication by restricting it to register and memory communication among blocks Cross-core communication causes significant power overheads compared to computation [Bill Dally SC10] Simpler iCache structures Hardware should dynamically select the cores for mapping blocks

Different Mapping Strategies Distributing all execution across cores Intra-block dataflow communication across cores Inter-block register communication across cores Mapping one block per core Intra-block dataflow communication within core Inter-block register communication across cores Flat Mapping Deep Mapping

Traditional Out-of-order Execution … Hardware generates a dynamic dataflow graph of the fetch instructions Execute instructions out of order Commit in-order and update architectural state of the program Complicated logic for generating and maintaining the graph dynamically Does that scale? In-order Fetch Scheduling Logic Reorder Buffer Register Renaming Registers In-order Commit Memory …

Scaling Challenges of Fat Cores … Components in charge of construction/maintaining of instruction window Complexity grows quadratically # of ports Logic Slowdown in power scaling  these structures do not scale any more! In-order Fetch Scheduling Logic Register Renaming Reorder Buffer Registers Memory In-order Commit …

Compiler Can Help … Most of the dependence graph is known at compile time Some long-term memory and register dependences are not known statically Compiler can generate and give these graphs to hardware Significantly reduces dependency detection, fetch and prediction overheads Block Prediction & Fetch Reg File Memory In-order Block Commit …

EDGE ISAs

Distributing Execution across Cores Mapping blocks of dataflow instructions on a grid of many wimpy cores [MICRO08] Maximize performance Small communication overhead Different tradeoffs Different types of parallelism and communication Among instructions in each block Among parallel blocks Characteristics of light-weight cores Design space exploration

Different Mapping Strategies Flat mapping (Traditional) Exploits intra-block parallelism Compiler can help: scheduling and register allocation [LCPC08, PACT08] Hard to achieve a global solution! Intra-block dataflow communication is a bottleneck Deep mapping Saves cross-core communication by limiting dataflow to single cores Limited intra-block parallelism Simpler instruction cache structures Dynamically select the cores for mapping blocks Flat Mapping Deep Mapping

Space Exploration for Distributing Computation Percent of total number of hops across all hops using flat mapping SPEC speedups over one single dual-issue cores Inter-block Intra-block Inter-block Register communication is now bottleneck Deep mapping better for 2w issue Saves energy and delay Flat mapping works for 1w issue

Next Block Prediction Coarse-grain branch prediction Trace processors Multiple predictions per access Similar problem with dataflow blocks Predicting the next block Multiple predicate paths in each block Predicted block path

Next Block Prediction Branches in blocks converted to predicates Predicting next block Multiple branches per block The taken branch depends on executed predicate path Solution Predict taken predicate path and use it to predict next block Speculatively execute dataflow predicate path! RD2 tstlti RD1 f t - * BR2 + tstz BR1 f BR3 t WR2 WR1 Predicted block path Taken predicated path

Block Reissue Instruction reuse: trace caches and loop buffer in out-of-order processors There is no control flow with each block so blocks can be reissued if they are still in the window Reissuing critical instructions following pipeline flushes to reduce misspeculation penalty [HPCA11] Saves energy & delay by reducing fetches and decode by about 50%

Reducing Multi-core Register Communication Problem: Cross-core communication via distributed registers Distributed registers forwarding [TRIPS] or network broadcasts [TLS] Solution: Late values via very low-overhead direct register value bypassing while the rest use register forwarding [HPCA11] B0 D D D D B1 R R R R Register bypassing for critical register values Register value forwarding via register home core

SPEC FP Performance/Energy Results # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Pollack’s 45x energy-delay^2 improvement with 16 cores!

Bottlenecks in TFlexComposable EDGE Design Intra-block operand communication due to fine-grain instruction distribution among cores Inter-block register communication among cores Expensive refills after pipeline flushes Poor next-block prediction accuracy and low speculation rate due to predicates Compiler-generated fanout trees built for high-fanout operand delivery

Previously Presented in Thesis Proposal Deep block mapping [MICOR08, PACT08, LCPC08] More coarse-grained parallelism and less cross-core operand traffic by mapping each block into one core Register bypassing [HPCA11] Reducing cross-core register communication delay by bypassing register values predicted to be critical directly from producing to consuming cores Block reissue [HPCA11] Reducing pipeline flush penalties by allowing instructions in previously executed blocks to be reissued while they are still in the instruction queue

SPEC INT Performance/Energy Results T3 # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Pollack’s TFlex-8 is close to TRIPS while T3-8 outperforms TRIPS by 1.43 and 25% less energy

Results Breakdown Major delay savers: IPP, block mapping and block reissue Major energy savers: EOBs, block mapping and block reissue # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores

SPEC INT Cross-Platform Comparison P Few cores (1 to 2)  Energy efficient with high performance More cores (4 to 8)  Increased performance for low energy cost P E E # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Efficiently covering much larger operation spectrum than DVFS

SPEC FP Performance/Energy Results # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Pollack’s

SPEC FP Cross-Platform Comparison P E P E # of cores # of cores Normalized core + L1 energy consumed over single dual-issue cores Speedup over single dual-issue cores Significantly improved performance and energy efficiency compared to INT

Final Criticality Results

Efficient Execution of Single-thread Programs across Multiple Cores