320 likes | 338 Views
This paper explores the benefits of decoupled pipelines in computer engineering. It covers the motivation behind asynchronous design, advantages of decoupled pipelines, and showcases experimental results. The text dives into the design aspects, average-case optimizations, and synchronization mechanisms of decoupled pipelines.
E N D
Decoupled Pipelines:Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign
Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results
Motivation • Why Asynchronous? • No clock skew • No clock distribution circuitry • Lower power (potentially) • Increased modularity • But what about performance? • What is the architectural benefit of removing the clock? • Decoupled Pipelines!
Motivation • Advantages of a Decoupled Pipeline • Pipeline achieves average-case performance • Rarely taken critical paths no longer affect performance • New potential for average-case optimizations
Synchronizing mechanism Synchronous Latch Synchronous clock Stage1 Stage2 Stage3 data data Asynchronous Communication Protocol Decoupled Self-Timing Logic Elastic Buffer go ack go ack Control1 Control2 Control3 Stage1 Stage2 Stage3 data data Synchronous vs. Decoupled
Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results
Start Self-Timing Circuit Done Computational Circuit Input Output Self-Timed Logic • Bounded Delay Model • Definition: event = signal transition • start event provided when inputs are available • done event produced when outputs are stable • Fixed delay based on critical path analysis • Computational circuit is unchanged
X O R S E L C 1 0 Asynchronous Logic Gates • C-gatelogical AND • Waits for events to arrive on both inputs • XOR-gatelogical OR • Waits for an event to arrive on either input • SEL-gatelogical DEMUX • Routes input event to one of the outputs
go Sender Stage Receiver Stage ack data 1 1 0 0 go ack data_1 data_2 Transaction 1 Transaction 2 Asynchronous Communication Protocol • 2-Step, Event Triggered, Level Insensitive Protocol • Transactions are encoded in go / ack events • Asynchronously passes instructions between stages
Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results
From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit DSEP Microarchitecture Decoupled, Self-Timed, Elastic Pipeline • At a high-level: • 9 stage dynamic pipeline • Multiple instruction issue • Multiple functional units • Out-of-order execution • Looks like Intel P6 µarch • What’s the difference?
From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit DSEP Microarchitecture Decoupled, Self-Timed, Elastic Pipeline • Decoupled: • Each stage controls its own latency • Based on local critical path • Stage balancing not important • Each stage can have several different latencies • Selection based on inputs • Pipeline is operating at several different speeds simultaneously!
Fetch Execute Retire Pipeline Elasticity • Definition: • Pipeline’s ability to stretch with the latency of its instruction stream • Global Elasticity • Provided by reservation stations and reorder buffer • Same for synchronous and asynchronous pipelines • When Execute stalls, the buffers allow Fetch and Retire to keep operating
From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit Pipeline Elasticity • Local Elasticity • Needed for a completely decoupled pipeline • Provided by micropipelines • Variable length queues between stages • Efficient implementation, little overhead • Behave like shock absorbers
Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results
Analysis • Synchronous Processor • Each stage runs at the speed of the worst-case stage running its worst-case operation • Designer: Focus on critical paths, stage balancing • DSEP • Each stage runs at the speed of its own average operation • Designer: Optimize for most common operation • Fundamental advantage of Decoupled Pipeline
Generic Stage Select logic Inputs M U X Short operation Outputs Long operation Average-Case Optimizations • Consider a generic example: • If short op is much more common, throughput is proportional to the select logic • Designer’s Strategy: • Implement fine grain latency tuning • Avoid latency of untaken paths
ALU Self-Timing Circuit S E L Arithmetic X O R Logic Start Done Shift Compare Inputs ALU Computational Circuit Output Average-Case ALU • Tune ALU latency to closely match the input operation • ALU performance is proportional to the average op • Computational Circuit is unchanged
Decoder Self-Timing Circuit Format 1 S E L X O R Start Done Format 2 Format 3 Input Decoder Computational Circuit Output Average-Case Decoder • Tune Decoder latency to match the input instruction • Common instructions often have simple encodings • Prioritize most frequent instructions
Optimized Fetch Alignment Aligned? M U X Fetch Block Address Inst. Block Fetch Align/Mask Average-Case Fetch Alignment • Optimize for aligned fetch blocks • If the fetch block is aligned on a cache line, it can skip alignment and masking overhead • Optimization is effective when software/hardware alignment optimizations are effective
Optimized Cache Access To Same Line? Address Read line from cache M U X Cache Line Previous line Average-Case Cache Access • Optimize for consecutive reads to the same cache line • Allows subsequent references to skip cache access • Effective for small stride access patterns, tight loops in I-Cache • Very little overhead for non-consecutive references
Optimized Comparator ? Inputs 4-bit Compare M U X Output 32-bit Compare Average-Case Comparator • Optimize for the case that a difference exists in the lower 4 bits of the inputs • 4-bit comparison is > 50% faster than 32-bit • Very effective for iterative loops • Can be extended for tag comparisons
Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Evaluation
Simulation Environment • VHDL Simulator using Renoir Design Suite • MIPS I Instruction set • Fetch and Retire Bandwidth = 1 • Execute Bandwidth ≤ 4 • 4-entry split Instruction Window • 64-entry Reorder Buffer • Benchmarks • BS 50-element bubble sort • MM 10x10 integer matrix multiply
Operation DSEP Latencies Fixed Latencies Fetch 100 120 Decode 50/80/120 120 Rename 80/120/150 120 Read 120 120 Execute 20/40/80/100/ 130/150/360/600 120/360/600 Retire 5/100/150 120 Caches 100 120 Main Memory 960 960 Micropipeline Register 5 5 Two Pipeline Configurations “Synchronous” Clock Period = 120 time units
DSEP Performance • Compared Fixed and DSEP configurations • DSEP increased performance 28% and 21% for BS and MM respectively Execution Time
Micropipeline Performance • Goals: • Determine the need for local elasticity • Determine appropriate lengths of the queues • Method: • Evaluate DSEP configurations of form AxBxC • A Micropipelines in Decode, Rename and Retire • B Micropipelines in Read • C Micropipelines in Execute • All configurations include fixed length instruction window and reorder buffer
Micropipeline Performance • Measured percent speedup over 1x1x1 • 2x2x1 best for both benchmarks • 2.4% performance improvement for BS, 1.7% for MM • Stalls in Fetch reduced by 60% for 2x2x1 Matrix-Multiply Bubble-Sort Percent Speedup
OOO Engine Utilization • Measured OOO Engine utilization • Instruction Window (IW) and Reorder Buffer (RB) • Utilization = Avg # of instructions in the buffer • IW-Utilization up 75%, RB-Utilization up 40% Instruction Window Reorder Buffer Utilization
Total Performance • Compared Fixed and DSEP configurations • DSEP 2x2x1 increased performance 29% and 22% for BS and MM respectively Execution Time
Conclusions • Decoupled, Self-Timing • Average-Case optimizations significantly increase performance • Rarely taken critical paths no longer matter • Elasticity • Removes pipeline jitter from decoupled operation • Increases utilization of existing resources • Not as important as Average-Case Optimizations (At least for our experiments)