1 / 28

Trace Cache

Trace Cache . Ka-Ming Keung Swamy D Ponpandi. Lecture Topics. Trace Cache Block based trace cache Optimizations. Motivation. Complex superscalar processor design to extract higher ILP -diminishing returns Issues – Fetch bandwidth. Producer-Consumer View. Branch outcome/Jump address.

gitel
Download Presentation

Trace Cache

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trace Cache Ka-Ming Keung Swamy D Ponpandi

  2. Lecture Topics • Trace Cache • Block based trace cache • Optimizations CPRE 585 Fall 2004

  3. Motivation • Complex superscalar processor design to extract higher ILP -diminishing returns • Issues – Fetch bandwidth CPRE 585 Fall 2004

  4. Producer-Consumer View Branch outcome/Jump address Instruction Fetch & Decode Instruction Execution Instruction Buffers CPRE 585 Fall 2004

  5. Fetch Bandwidth • Branch Outcome • Cache Hit rate • Branch Throughput • Instruction Alignment CPRE 585 Fall 2004

  6. The Trace Cache • Capture dynamic instruction sequences – Snapshot • Cache line – N instructions and M basic blocks Dynamic instruction stream trace{A:taken,taken} trace{A:taken,taken} later …. A A t t t t Fill New Trace from Instruction cache Access existing trace using A and predictions (t,t) TRACE CACHE TRACE CACHE t t t t A A 1st basic block 3rd basic block (still filling) TO DECODER 2nd basic block CPRE 585 Fall 2004

  7. Pros and Cons • Deliver multiple blocks of instructions in the same cycle without support from a compiler and modifying instruction set • Not on critical path, complexity is moved out of the fetch-issue pipeline where additional latency impacts performance CPRE 585 Fall 2004

  8. Pros and Cons • No need to rotate, shift basic blocks to create dynamic instruction sequence • Redundant Information CPRE 585 Fall 2004

  9. Trace Cache – Dissection Fetch Address Trace Cache Core Fetch Unit n instructions n instructions Line fill 2:1 Mux Instruction latch To Decoder CPRE 585 Fall 2004

  10. Trace Cache – More details Fetch address from decoder Trace Cache Core fetch unit Branch flags Target addr Merge logic Tag Branch mask Instruction Cache Fill Control Branch Target Buffer Line fill buffer Return Address Stack Predictor 1st branch BTB Logic 2nd branch fall through 3rd branch Hit Logic mask/shift/interchange To TC next fetch address n instructions n instructions Fetch address from predictor CPRE 585 Fall 2004 To instruction latch for decoding

  11. Trace Cache Component • Valid bit • Tag • Branch flags • Branch mask • Trace fall-through address • Trace target address • Line-Fill Buffer CPRE 585 Fall 2004

  12. Trace Cache Hit • Conditions 1. The fetch address match the tag 2. The branch predictions match the branch flags • If Hits, an entire trace of instructions is fed into the instruction CPRE 585 Fall 2004

  13. Trace Cache Miss • If Misses, fetching proceeds normally from the instruction cache CPRE 585 Fall 2004

  14. Design Issues • Associativity • Multiple paths • Partial matches • Indexing methods • Cache line size • Filling • Trace selection • Victim trace cache CPRE 585 Fall 2004

  15. Results CPRE 585 Fall 2004

  16. CPRE 585 Fall 2004

  17. Trace cache effectiveness CPRE 585 Fall 2004

  18. Pentium 4 Trace Cache • 6 uops per line (what is the benefit of storing as uops?) • Max 12 k uops (21 KB) • Uses virtual addresses so no need for address translation CPRE 585 Fall 2004

  19. Block-Based Trace Cache • Fetch Address renaming • Advantage: Lesser bits • Unique renamed pointer is assigned to each address • No Tag Comparison, Faster instruction fetch • Rename table which is maintained at completion • Move the complexity and latency from instruction fetch time to completion time CPRE 585 Fall 2004

  20. Block • Instructions are stored on a block. • Each block contains a trace of instruction • Block is updated at the end of each instruction execution • Each block has a block ID CPRE 585 Fall 2004

  21. How to find out the trace? • Original: Fetched address • Now: Trace id • Trace id is determined by • the branch history, and • Past block id predictions • Trace table stores the block id predictions CPRE 585 Fall 2004

  22. Rename Table • Check if the predicted block exists • Give a block id to a new trace CPRE 585 Fall 2004

  23. Dynamic Optimizations • Register Moves • Register move does not require calculation • Can be performed in ROB and renaming unit • Trace Cache can do it in a better way CPRE 585 Fall 2004

  24. Register Moves • Advantages • By Placing detection logic in fill unit, the decode and rename logic execute move instruction without having to pay for latency of detecting them • Requires no execution resources, thus incur no delays due to artifacts. CPRE 585 Fall 2004

  25. Re-association addi Rx,Ry,4 addi Rz,Rx,4 • Decreases the dependency addi Rx, Ry,4 addi Rz, Ry,8 CPRE 585 Fall 2004

  26. Scaled Adds shifti Rw, Rx << 1 shifti Rw, Rx << 1 add Ry, Rw, Rz scaled Add Ry, (Rx<<1), Rz • Decreases the dependency CPRE 585 Fall 2004

  27. Instruction Placement • Many instructions are often unable to execute the cycle after their source operands are produced. • Reorder the instructions within the trace cache CPRE 585 Fall 2004

  28. Conclusion • Increases IPC • New area for code optimization CPRE 585 Fall 2004

More Related