1 / 32

Decoupled Pipelines: Rationale, Analysis, and Evaluation

Decoupled Pipelines: Rationale, Analysis, and Evaluation. Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign. Outline. Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results.

Download Presentation

Decoupled Pipelines: Rationale, Analysis, and Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoupled Pipelines:Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign

  2. Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results

  3. Motivation • Why Asynchronous? • No clock skew • No clock distribution circuitry • Lower power (potentially) • Increased modularity • But what about performance? • What is the architectural benefit of removing the clock? • Decoupled Pipelines!

  4. Motivation • Advantages of a Decoupled Pipeline • Pipeline achieves average-case performance • Rarely taken critical paths no longer affect performance • New potential for average-case optimizations

  5. Synchronizing mechanism Synchronous Latch Synchronous clock Stage1 Stage2 Stage3 data data Asynchronous Communication Protocol Decoupled Self-Timing Logic Elastic Buffer go ack go ack Control1 Control2 Control3 Stage1 Stage2 Stage3 data data Synchronous vs. Decoupled

  6. Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results

  7. Start Self-Timing Circuit Done Computational Circuit Input Output Self-Timed Logic • Bounded Delay Model • Definition: event = signal transition • start event provided when inputs are available • done event produced when outputs are stable • Fixed delay based on critical path analysis • Computational circuit is unchanged

  8. X O R S E L C 1 0 Asynchronous Logic Gates • C-gatelogical AND • Waits for events to arrive on both inputs • XOR-gatelogical OR • Waits for an event to arrive on either input • SEL-gatelogical DEMUX • Routes input event to one of the outputs

  9. go Sender Stage Receiver Stage ack data 1 1 0 0 go ack data_1 data_2 Transaction 1 Transaction 2 Asynchronous Communication Protocol • 2-Step, Event Triggered, Level Insensitive Protocol • Transactions are encoded in go / ack events • Asynchronously passes instructions between stages

  10. Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results

  11. From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit DSEP Microarchitecture Decoupled, Self-Timed, Elastic Pipeline • At a high-level: • 9 stage dynamic pipeline • Multiple instruction issue • Multiple functional units • Out-of-order execution • Looks like Intel P6 µarch • What’s the difference?

  12. From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit DSEP Microarchitecture Decoupled, Self-Timed, Elastic Pipeline • Decoupled: • Each stage controls its own latency • Based on local critical path • Stage balancing not important • Each stage can have several different latencies • Selection based on inputs • Pipeline is operating at several different speeds simultaneously!

  13. Fetch Execute Retire Pipeline Elasticity • Definition: • Pipeline’s ability to stretch with the latency of its instruction stream • Global Elasticity • Provided by reservation stations and reorder buffer • Same for synchronous and asynchronous pipelines • When Execute stalls, the buffers allow Fetch and Retire to keep operating

  14. From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit Pipeline Elasticity • Local Elasticity • Needed for a completely decoupled pipeline • Provided by micropipelines • Variable length queues between stages • Efficient implementation, little overhead • Behave like shock absorbers

  15. Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results

  16. Analysis • Synchronous Processor • Each stage runs at the speed of the worst-case stage running its worst-case operation • Designer: Focus on critical paths, stage balancing • DSEP • Each stage runs at the speed of its own average operation • Designer: Optimize for most common operation • Fundamental advantage of Decoupled Pipeline

  17. Generic Stage Select logic Inputs M U X Short operation Outputs Long operation Average-Case Optimizations • Consider a generic example: • If short op is much more common, throughput is proportional to the select logic • Designer’s Strategy: • Implement fine grain latency tuning • Avoid latency of untaken paths

  18. ALU Self-Timing Circuit S E L Arithmetic X O R Logic Start Done Shift Compare Inputs ALU Computational Circuit Output Average-Case ALU • Tune ALU latency to closely match the input operation • ALU performance is proportional to the average op • Computational Circuit is unchanged

  19. Decoder Self-Timing Circuit Format 1 S E L X O R Start Done Format 2 Format 3 Input Decoder Computational Circuit Output Average-Case Decoder • Tune Decoder latency to match the input instruction • Common instructions often have simple encodings • Prioritize most frequent instructions

  20. Optimized Fetch Alignment Aligned? M U X Fetch Block Address Inst. Block Fetch Align/Mask Average-Case Fetch Alignment • Optimize for aligned fetch blocks • If the fetch block is aligned on a cache line, it can skip alignment and masking overhead • Optimization is effective when software/hardware alignment optimizations are effective

  21. Optimized Cache Access To Same Line? Address Read line from cache M U X Cache Line Previous line Average-Case Cache Access • Optimize for consecutive reads to the same cache line • Allows subsequent references to skip cache access • Effective for small stride access patterns, tight loops in I-Cache • Very little overhead for non-consecutive references

  22. Optimized Comparator  ? Inputs 4-bit Compare M U X Output 32-bit Compare Average-Case Comparator • Optimize for the case that a difference exists in the lower 4 bits of the inputs • 4-bit comparison is > 50% faster than 32-bit • Very effective for iterative loops • Can be extended for tag comparisons

  23. Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Evaluation

  24. Simulation Environment • VHDL Simulator using Renoir Design Suite • MIPS I Instruction set • Fetch and Retire Bandwidth = 1 • Execute Bandwidth ≤ 4 • 4-entry split Instruction Window • 64-entry Reorder Buffer • Benchmarks • BS 50-element bubble sort • MM 10x10 integer matrix multiply

  25. Operation DSEP Latencies Fixed Latencies Fetch 100 120 Decode 50/80/120 120 Rename 80/120/150 120 Read 120 120 Execute 20/40/80/100/ 130/150/360/600 120/360/600 Retire 5/100/150 120 Caches 100 120 Main Memory 960 960 Micropipeline Register 5 5 Two Pipeline Configurations “Synchronous” Clock Period = 120 time units

  26. DSEP Performance • Compared Fixed and DSEP configurations • DSEP increased performance 28% and 21% for BS and MM respectively Execution Time

  27. Micropipeline Performance • Goals: • Determine the need for local elasticity • Determine appropriate lengths of the queues • Method: • Evaluate DSEP configurations of form AxBxC • A  Micropipelines in Decode, Rename and Retire • B  Micropipelines in Read • C  Micropipelines in Execute • All configurations include fixed length instruction window and reorder buffer

  28. Micropipeline Performance • Measured percent speedup over 1x1x1 • 2x2x1 best for both benchmarks • 2.4% performance improvement for BS, 1.7% for MM • Stalls in Fetch reduced by 60% for 2x2x1 Matrix-Multiply Bubble-Sort Percent Speedup

  29. OOO Engine Utilization • Measured OOO Engine utilization • Instruction Window (IW) and Reorder Buffer (RB) • Utilization = Avg # of instructions in the buffer • IW-Utilization up 75%, RB-Utilization up 40% Instruction Window Reorder Buffer Utilization

  30. Total Performance • Compared Fixed and DSEP configurations • DSEP 2x2x1 increased performance 29% and 22% for BS and MM respectively Execution Time

  31. Conclusions • Decoupled, Self-Timing • Average-Case optimizations significantly increase performance • Rarely taken critical paths no longer matter • Elasticity • Removes pipeline jitter from decoupled operation • Increases utilization of existing resources • Not as important as Average-Case Optimizations (At least for our experiments)

  32. Questions?

More Related