Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and SibiGovindan University of Texas at Austin Doug Burger Microsoft Research Stephen W. Keckler Architecture Research Group, NVIDIA & University of Texas at Austin

Motivation Running each single thread faster and more power effectively by using multiple cores: increases parallel systems efficiency lessens the needs for heterogeneity and its software complexity! Do we stillcare about single thread execution?

Summary Distributed uniprocessors: multiple cores sharing resources to run a thread across Scalable complexity but cross-core delay overheads Performance scalability overheads? Registers, memory, fetch, branches, etc?! Measure critical cross-core delays using profile-based critical path analysis Low-overhead distributed mechanisms to mitigate these bottlenecks

Distributed Uniprocessors Linear Complexity L1 L1 RF RF L1 BP BP RF L1 L1 RF RF BP BP BP Inter-core control communication Inter-core data communication Partition single-thread instruction stream across cores Distributed resources (RF, BP and L1) act like a large processor Inter-core instruction, data and control communication Goal: Reduce these overheads

Example Distributed Uniprocessors Older designs: Multiscalar and TLS use a noncontiguous instruction window Recent designs: CoreFusion, TFlex, WiDGET and Forwardflow This study uses TFlex as the underlying distributed uniprocessor

T0 T1 L2 L2 L2 L2 C C C C T2 T3 L2 L2 L2 L2 C C C C T4 T5 L2 L2 L2 L2 C C C C L2 L2 L2 L2 C C C C T6 L2 L2 L2 L2 C C C C L2 L2 L2 L2 C C C C T7 L2 L2 L2 L2 C C C C L2 L2 L2 L2 C C C C TFlex Distributed Uniprocessor C1 C0 32 physical cores Reg [R0] Reg [R1] B1 B0 Intra-block IQ-local communication 8 logical processors (threads) L1 L1 Inter-block cross-core communication Reg [R2] Reg [R3] B2 B3 Control dependences L1 L1 C2 C3 • Maps one predicated data-flow block to each core • Blocks communicate across registers (via register home cores) • Example: B2 on C2 communicates to B3 on C3 through R1 on C1 • Intra-block communication is all dataflow

Profile-based Critical Path Bottleneck Analysis Using critical path analysis toquantifyscalable resources and bottlenecks SPEC INT Real work network Fetch bottleneck caused by mispredicted blocks One of the scalable resources Register communication overhead

DistributedCriticality Analyzer Fetch-critical block reissue RegWrite Decode Fetch Issue Execute Commit RegWriteBypass DecodeMerge output critical input critical Communication-criticality inst type Executing core Requested block PC Criticality Predictor pred_input i_counter Predicted comm-critical insts Coordinator components pred_output o_counter Requested block PC Block Reissue Engine available_blocks_bitpattern Core selected for running fetch-critical block An entry in block criticality status table 8 • A statically-selected coordinator core is assigned to each region of the code executing on a core • Each coordinator core holds and maintains criticality data for the regions assigned to it • Sends criticality data to executing core when the region is fetched • Enables register bypassing, dynamic merging, block reissue, etc.

Register Bypassing Sample Execution: Block B2 communicating to B3 through register path 1 & 2 (2 is slow) Bypassing critical register values on the critical path Coordinator Core 0 predicts late communication instructions B21 & B31(only path 2 is predicted) C0 C1 Register bypassing 2 B0 B1 Reg [R2] Reg [R3] Intra-block IQ-local communication L1 L1 Inter-block cross-core communication Coordination signals 1 B2 Reg [R0] B3 1 Reg [R1] 2 2 1 Last arriving L1 Last departing L1 2 2 C2 C3 Output critical Input critical

Optimization Mechanisms • Output criticality: Register bypassing • Explained in previous page (saves delay) • Input criticality: Dynamic merging • Decode time dependence height reduction for critical input chains (saves delay) • Fetch criticality: Block reissue • Reissuing critical instructions following pipeline flushes (saves energy & delay by reducing fetches by about 40%)

Aggregate Performance Optimization mechanism 16-core individual and aggregate results

Final Critical Path Analysis SPEC INT network improved distribution 8 base 8 optimized 1 base 16 base 16 optimized

Performance Scalability Results FP INT SPEC INT SPEC FP Speedup over single dual-issue cores # of cores # of cores 16-core INT: 22% speedup Follows Pollack’s rule by up to 8 cores

Energy Delay Square Product 65nm, 1.0v, 1GHz 8-core INT: 50% increase in ED2 Energy efficient configuration changes from 4 to 8-core

Conclusions and Future Work Goal: A power/performance scalable distributed uniprocessor This work addressed several key performance scalability limitations Next steps (4x speedup om SPEC INT):

Questions?

Backup Slides • Setup and Benchmarks • CPA Example • Single Core IPCs • Communication Criticality Example • Fetch Criticality Example • Full Performance Results • Criticality Predictor • Motivation

Backup Slides

Summary Distributed uniprocessors: multiple cores can share their resources to run a thread across Scalable complexity but cross-core delay overheads Running each single thread effectively across multiple cores significantly increases parallel systems efficiency and lessens the needs for heterogeneity and its software complexity! What are the overheads that limit performance scalability? Registers, memory, fetch, branches, etc?! Do we stillcare about single thread execution? We measure critical cross-core delays using static critical path analysis and find ways to hide them Major detected bottlenecks: cross-core register communication and fetches on flushes We propose low-overhead distributed mechanisms to mitigate these bottlenecks

Motivation • Need for scaling single-thread performance/power in multicore • Amdahl’s law • Optimized power/performance for each thread • Distributed Uniprocessors • Running single-thread code across distributed cores • Sharing resources but also partitioning overhead • Focus of this work • Static critical path analysis to quantify bottlenecks • Dynamic hardware to reduce critical cross-core latencies

Distributed Uniprocessors L1 L1 RF RF L1 RF BP BP BP L1 L1 RF RF BP BP Partition single-thread instruction stream across cores Distributed resources (RF, BP and L1) act like a large processor

Exploiting Communication Criticality Sample Execution: Block B0 communicating to B1 through B2 Predicting critical instructions in blocks B0 and B1 Forwarding critical register values Replacing fanout for critical input with broadcast messages Register forwarded Reg [R2] Reg [R3] B3 Intra-block IQ-local communication B2 L1 L1 Broadcast message Inter-block cross-core communication Reg [R0] B0 Reg [R1] B1 fanout Last arriving L1 Last departing L1 Output critical Input critical

Dynamic Merging Results Speedup over no merging 65% of max using cfactor of 1 16-core runs cfactor: No. of predicted late inputs per block full merge: Running the alg on all reg inputs

Block Reissue Results 16-core runs Block hit rates x IQ Affected by dep. prediction

Critical Path Bottleneck Analysis Using critical path analysis toquantifyscalable resources and bottlenecks SPEC INT Fetch bottleneck caused by mispredicted blocks One of the scalable resources Register communication overhead

Performance Scalability Results SPEC INT SPEC FP Speedup over single dual-issue cores # of cores # of cores 16-core INT: 22% speedup Follows Polluck’s rule by up to 8 cores

Block Reissue • Each core maintains a table of available blocks and the status of their cores • Done by extending alloc/commit protocols • Policies • Block Lookup: Previously executed copies of the predicted block should be spotted • Block Replacement: Refetch if the predicted block is not spotted in any core • Major power saving on fetch/decode

P P L2 L2 L2 L2 C C P P L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C L2 L2 L2 L2 C C C C P L2 L2 L2 L2 L2 L2 L2 L2 1 cycle latency L2 L2 L2 L2 L2 L2 L2 L2 TFlex Cores Inst Queue Reg File P RWQ L1 Cache LSQ BPred • Each core has (shared when fused) • 1-ported cache bank (LSQ), 1-ported reg banks (RWQ) • 128-entry RAM-based IQ, a branch prediction table • When fused • Registers, memory location and BP tables are stripped across cores Courtesy of Katie Coons for the figure

P P L2 L2 L2 L2 C C P P L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C L2 L2 L2 L2 C C C C P L2 L2 L2 L2 L2 L2 L2 L2 1 cycle latency L2 L2 L2 L2 L2 L2 L2 L2 TFlex Cores Inst Queue Reg File P RWQ L1 Cache LSQ BPred • Each core has minimum resources for one block • 1-ported cache bank, 1-ported reg bank (128 regs) • 128-entry RAM-based IQ, a branch prediction table • RWQ and LSQ holds the transient arch states during execution and commits the states at commit time • LSQ supports memory dependence prediction Courtesy of Katie Coons for the figure

Critical Output Bypassing • Bypass late outputs to their destination instructions directly • Similar to memory bypassing and cloaking [Sohi ‘99] but no speculation needed • Using predicted late outputs • Restricted between subsequent blocks

Simulation Setup

Predicting Critical Instructions • State-of-the-art predictor [Fields ‘01] • High communication and power overheads • Large storage overhead • Complex token-passing hardware • More complicated be ported to a dynamic CMP • Need a simple, low-overhead while efficient predictor

Proposed Mechanisms Register forwarding Dynamic instruction merging Block Reissue 33 Cross-core register communication Dataflow software fanout trees Expensive refill after pipeline flushes Fixed block sizes Poor next block prediction accuracy Predicates not being predicated

Critical Path Analysis Simulator Event Interface Critical Path Analysis Tool • Processes program dependence graph [Bodic ‘01] • Nodes: uarch events • Edges: data and uarchdep.s • Measure contribution of each uarch resource • More effective than simulation or profile-based techniques • Built on top of [Nagarajan ‘06]

Block Reissue Hit rates

IPC of Single TFlex One 2w Core SPEC INT, IPC = 0.8 SPEC FP, IPC = 0.9

SPEC INT Speculation Aware cf =1

Critical Path Analysis 38 • Critical path: Longest dependence path during program execution • Determines execution time • Critical path analysis [Bodic ‘01] • Measure contribution of each uArch resource on critical cycles • Built on top of TRIPS CPA [Nagarajan ‘06]

Exploiting Fetch Criticality Predicted fetched blocks: B0, B1, B0, B0 Actual block order: B0, B0, B0, B0 With block reissue: Coordinator core (C0) detects B0 instances on C2-3 and reissues them Without using block reissue all 3 blocks will be flushed C1 C0 B1 Reg Reg B0 B0 B1 B0 Fetched blocks Reissued blocks B0 L1 L1 Refetched blocks B1 Cross-core block control order B0 B0 B0 Reg B0 Reg B0 B0 B0 B0 Coordination signals CFG L1 L1 C2 C3 50% reduction in fetch and decode operations

Full Performance Comparison

Full Energy Comparison

Communication Criticality Predictor • Block-atomic execution Late inputs and outputs are critical • Last outputs/inputs departing/arriving before block commit • 70% and 50% of late inputs/outputs are critical for SPEC INT and FP • Extend next block predictor protocol • MJRTY algorithm [Moore ‘82] to predict/train • Increment/decrement a confident counter upon correct/incorrect prediction of current majority

Exploiting Communication Criticality • Selective register forwarding • Critical register outputs are forwarded to subsequent cores • Others outputs use original indirect register forwarding using RWQs • Selective instruction merging • Specialize decode of instructions dependent on critical register input • Eliminates Dataflow fanoutmovesin address computation networks

Exploiting Fetch Criticality • Blocks after mispredictions are critical • Many flushed blocks may be re-fetched right after a misprediction • Blocks are predicated so old blocks can be reissued if their cores are free • Each owner core keeps track of its blocks • Extended allocate/commit protocols • Major power saving on fetch/decode

Exploiting Communication Criticality Sample Execution: Block B2 communicating to B3 through register path 1 & 2 (2 is slow) Fast forwarding critical register values on the critical path Coordinator Core 0 predicts late communication instructions B21 & B31(only path 2 is predicted) C0 C1 Register bypassing 2 Reg [R2] Reg [R3] B2 Intra-block IQ-local communication B0 L1 L1 Inter-block cross-core communication Coordination signals 1 Reg [R0] 1 B2 Reg [R1] B3 2 Last arriving 1 L1 Last departing L1 2 C2 C2 Output critical Input critical

Summary Distributed uniprocessors: multiple cores can share their resources to run a thread across Running each single thread effectively across multiple cores significantly increases parallel systems efficiency and lessens the needs for heterogeneity and its software complexity! Scalable complexity but cross-core delay overheads What are the overheads that limit performance scalability? Registers, memory, fetch, branches, etc?! Do we stillcare about single thread execution? We measure critical cross-core delays using static critical path analysis and find ways to hide them We propose low-overhead distributed mechanisms to mitigate these bottlenecks

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors