1 / 33

XBC - eXtended Block Cache

XBC - eXtended Block Cache. Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation. Frontend. Instructions. Execution. Memory. Data. The Frontend. The processor:. Frontend goal: supply instructions to execution

rbartel
Download Presentation

XBC - eXtended Block Cache

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation

  2. Frontend Instructions Execution Memory Data The Frontend • The processor: • Frontend goal: supply instructions to execution • Predict which instructions to fetch • Fetch the instructions from cache / memory • Decode the instructions • Deliver the decoded instructions to execution

  3. Requirements from the Frontend • High bandwidth • Low latency

  4. Jump into the line Jump out of the line The Traditional Solution: Instruction Cache • Basic unit: cache line • A sequence of consecutive instructions in memory • Deficiencies: • Low Bandwidth jmp • High Latency • Instructions need decoding

  5. Trace end condition jmp jmp jmp jmp Trace Cache • TC Goals:high bandwidth & low latency • Basic unit: trace • A sequence of dynamically executed instructions • Instructions are decoded into uops • Fixed length, RISC like instructions • Traces have a single entry, and multiple exits • Trace tag/index is derived from starting IP

  6. A B Redundancy in the TC Code If (cond) A B Possible Traces (i) AB (ii) B Space inefficiency low hit rate

  7. XBC Goals • High bandwidth • Low latency • High hit rate

  8. jmp jcc XBC - eXtended Block Cache • Basic unit: XB - eXtended Block • XB end conditions • Conditional or indirect branches • Call/Return • Quota (16 uops) • XB features • Multiple entry, single exit • Tag / index derived from ending instruction IP • Instructions are decoded

  9. jcc jcc  99% biased jcc jmp XBC Fetch Bandwidth • Fetch multiple XBs per cycle • A conditional branch ends a XB • Need to predict only 1 branch/ XB • Predicting 2 branch/cyc  fetch 2 XB/cyc • Promote 99% biased conditional branches* • Build longer XBs • Maximize XBC bandwidth for a given #pred/cyc *[Patel 98]

  10. 50% 45% BB 40% XB 35% XBp 30% DBL 25% 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 XB Length Block types Average Length BB basic block 7.7 XB don’t break on uncond 8.0 XBp XB + promotion 10.0 DBL group 2 XBp 12.7

  11. Bank0 Bank1 Bank2 Bank3 4 uop 4 uop 4 uop 4 uop Reorder & Align XBC Structure A banked structure which supports • Variable length XBs (minimize fragmentation) • Fetching multiple XBs/cycle

  12. bank0 bank1 bank2 bank3 1 0 Reorder & Align Support Variable Length XBs • An XB may spread over several Banks on the same set

  13. bank0 bank1 bank2 bank3 Reorder & Align Support Fetching 2 XBs/cycle • Data may be received from all Banks in the same cycle 0 1 1 0 0 1 0 1

  14. bank0 bank1 bank2 bank3 Reorder & Align Support Fetching 2 XBs/cycle • Actual bandwidth may be sometimes less than 4 banks per cycle 1 0 1 0 0 1 0

  15. bank0 bank1 bank2 bank3 Reorder Banks Mux 1 banki0 banki1 banki2 banki3 Align Uops Mux 2 bnki0 bnki1 banki2 Reordering and Aligning Uops Empty uops

  16. bank0 bank1 bank2 bank3 1 0 0 1 2 Reorder & Align XBC Structure • The average XB length is >8 uops • 16 uop/line is < 2-XB set associative 16 uop

  17. 0 0 1 2 XBC Structure • The average XB length is >8 uops • make each bank set-associative bank0 bank1 bank2 bank3 1 Reorder & Align

  18. The XBTB • The XBTB provides the next XB for each XB • XBs are indexed according to ending IP • Cannot directly lookup next IP in the XBC • XBC can only be accessed using the XBTB • XBTB provides info needed to access next XB • The IP of the next XB • Defines the set in which the XB resides • A masking vector, indicating the banks in which the XB resides • The #uops counted backward from the end of XB • Defines where to enter the XB • XBTB provides next 2 XBs

  19. BTB Memory / Cache Decoder Fill Unit Priority Encode XBQ XBTB XBC Delivery mode XBC Structure: the whole picture Build mode

  20. XBnewXBexist XBnew XBexist XBnew XBexist XBexist XBexist IP1 IP1 XBexist IP1 XBnew IP1 XBnew XBnew IP1 IP1 Complex XB, Update XBTB Extend XBexist Update XBTB Update XBTB XB Build Algorithm • XBTB lookup fails  build a new XB into the fill buffer • End-of-XB condition reached  lookup XBC for the new XB • No match  store new XB in the XBC, and update XBTB • Match  there are three cases: The XBC has NO Redundancy

  21. XBexist IP1 XBexist IP1 Prefixnew XBnew IP1 Complex XBs • XBnew and XBexist have same suffix but different prefix: • Possible solution, complying to no-redundancy: Wrong Way • Drawback: we get 2 short XBs instead of a single long XB

  22. Prefixcur XBexist IP1 IP1 XBnew IP1 Suffix Prefixnew Complex XBs • XBnew and XBexist have same suffix but different prefix: • Second solution: a single “complex XB” Right Way • Complex XBs: no redundancy, but still high bandwidth

  23. bank0 bank1 bank2 bank3 1 2 3 8 9 4 5 6 7 1 2 3 4 5 6 7 8 9 Extending an Existing XB • If we store XB in the usual way, when an XB is extended, we need to move all its uops • An XB can only be extended at its beginning 0 • Since the existing uops move, the pointers in the XBTB become stale

  24. bank0 bank1 bank2 bank3 1 2 3 4 5 6 7 8 9 Storing Uops in Reverse Order • The solution is to store the uops of an XB in a reversed order • when an XB is extended, no need to move uops 0 • XB IP is the IP of the ending instruction  extending the XB does not change the XB IP

  25. Set Search • XB is replaced and then placed again • Not on same set  different XB • Same set, same banks  no problem • Same set but not on the same banks  XBTB entries which point to the old location of the XB are erroneous • Solution - Set Search • On an XBTB hit & XBC miss, try to locate the XB in other banks in the same set • Calculate new mask according to offset • Only a small penalty: cycle loss, but no switch to build

  26. XB Replacement • Use a LRU among all the lines in a given set • LRU also makes sure that we do not evict a line other than the first line of a XB (a head line) • There is no point in retaining the head line while evicting another line • if we enter the XB in the head line, we will get a miss when we reach the evicted line • if a head line is evicted, but we enter the XB in its middle, we may still avoid a miss • A non-head line is always accessed after a head line is accessed • its LRU will be higher • it will not be evicted before the head line

  27. XB Placement • Build-mode placement algorithm • New XB is placed in banks such that it does not have bank conflict with the previous XB (if possible) • LRU ordering is maintained by switching the LRU line with the non-conflicting line before the new XB is placed • Set-search repairs the XBTB • Delivery mode placement algorithm • repeating bandwidth losses due to bank conflicts found • conflicting lines are moved to non-conflicting banks • Each XB is augmented with a counter • incremented when XB has a bank conflict • when counter reaches threshold, the conflicting lines are switched with other lines in non-conflicting banks • A line can be switched with another line, only if its LRU is higher, or if both gain from the switch

  28. 7 6 5 4 3 Uop per Cycle 2 1 0 Games Average SpecINT SysmarkNT TC XBC XBC vs. TC Delivery Bandwidth

  29. 10% 9% TC XBC 8% >50% 7% 6% Uop Miss Rate 5% 29% 4% 3% 2% 1% 0% 16K 32K 64K Size - KUops Miss Rate as a Function of Size

  30. 9% 8% XBC 7% TC 6% Uop Miss Rate 5% 4% 3% 2% 1% 0% 1 2 4 Associativity Miss Rate as a Function of Size

  31. XBC Features Summary • Basic unit - XB • Ends with a conditional branch • Multiple entries, single exit • Indexed according to ending IP • Branch promotion  longer XBs • XBC uses a banked structure • Supports fetching multiple XBs/cycle • Supports variable length XBs • Uops within XBs are stored in reverse order

  32. Conclusions • Instruction Cache has high hit rate, but … • Low bandwidth, high latency • TC has high bandwidth, low latency, but … • Low hit rate • XBC combines the best of both worlds • High bandwidth, low latency and high hit rate

More Related