1 / 24

EECS 470 Lecture 13 Memory Speculation Winter 2014 http ://www.eecs.umich.edu/courses/eecs470

EECS 470 Lecture 13 Memory Speculation Winter 2014 http ://www.eecs.umich.edu/courses/eecs470.

yanni
Download Presentation

EECS 470 Lecture 13 Memory Speculation Winter 2014 http ://www.eecs.umich.edu/courses/eecs470

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 470 • Lecture 13 • Memory Speculation • Winter 2014 • http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenischof Carnegie Mellon University, Purdue University, University of Michigan, Univerity of Pennsylvania, and University of Wisconsin.

  2. Tomasulo-Style Scheduler Implementation • Synchronizationmanaged by scheduler logic • Communicationthrough input/output networks • Infrastructure geared towards register communication Results Inputs Value V Tag Network Control Input/Result Networks V Tag Value Scheduler Logic Op Flags Valid Bits Reservation Stations

  3. Out-of-Order Memory Operations • Scheduling is straightforward in out-of-order… • Register inputs only • Register renaming captures all true dependences • Tags tell you exactly when you can execute • … except loads • Register and memory inputs (older stores) • Register renaming does not tell you all dependences • How do loads find older in-flight stores to same address (if any)? • Issue of finding if addresses match is called “memory disambiguation”

  4. The Good: Register Communication • Directly specified dependencies (contained in inst) • Accuratedescription of communication • no false or missing dependency edges • permits realization of dataflow schedule • Earlydescription of communication • know dependencies upon decode • allows scheduler logic to be pipelined without impacting speed of communication • Small communication name space (32-64 usually) • Fastaccess to communication storage • possible to map entire communication space (no tags) • possible to bypass communication storage

  5. The Bad (and the ugly): Memory Scheduling • Loads/stores also have dependencies through memory • Described by effective addresses • Cannot directly leverage existing (register) infrastructure • Indirectly specified memory dependencies • Memory dependencies are a function of program computation, prevents early accurate description of communication • Pipelined scheduler slow to react to addresses • Large communication space (232-64 bytes!) • Cannot fully map communication space, requires more complicated cache and/or store forward network • Memory latency is variable • Complicates scheduling

  6. Requirements for a Solution • Accuratedescription of memory dependencies • No (or few) missing or false dependencies • Permit realization of dataflow schedule • Earlypresentation of dependencies • Permit pipelining of scheduler logic • Fastaccess to communication space • Preferably as fast as register communication (zero cycles)

  7. Memory Scheduling Techniques

  8. Some trivial ways to handle Loads • Allow only one load or store in OoO core • Stall other operations at dispatch – very slow • No need for LSQ • Load/store only issues when LSQ/ROB head (in-order scheduling) • Stall other operations at dispatch • Loads always get value from cache, only 1 outstanding load • More aggressive options for forwarding: • Conservative load-to-store forwarding, when addresses known • Optimistic load-to-store forwarding – requires rewind mechanism • Aggressive options for memory: • Aggressive fetching from memory if no load-to-store forwarding.

  9. Implementation • Several hardware realizations: • Unified LSQ (easier to understand, but nasty hardware) • Separate LQ* and SQ (more complicated, but fairly elegant) • We’ll start with a unified LSQ and move to separate LB and SQ. *Might end up with a load buffer rather than a queue…

  10. In-order Load/Store Scheduling • Schedule all loads and stores in program order • Cannot violate true data dependencies (non-speculative) • Capabilities/limitations: • Overly restrictive – likely to add many false dependencies • Early presentation of dependencies (no addresses) • Not fast, all communication through memory structures • Found in in-order issue pipelines Dependencies true realized st X ld Y program order st Z ld X ld Z

  11. In-order Load/Store Scheduling Example Dependencies time true realized st X st X st X st X st X st X ld Y ld Y ld Y ld Y ld Y ld Y program order st Z st Z st Z st Z st Z st Z ld X ld X ld X ld X ld X ld X ld Z ld Z ld Z ld Z ld Z ld Z

  12. Consider the LSQ cases • Which of the loads are we sure we will fulfill via D$/Memory? • Which of the loads will we fulfill via load-to-store forwarding? • Which aren’t we sure of? • Identify what to do with each load.

  13. Unified Load/Store Queue • Operates as a circular FIFO • Allocate on dispatch • De-allocate on retirement • Calc address in register dataflow order • A NxN comparator matrix detects memory address dependence (also considers relative age of entries) • Store ops are held until retirement • Load ops are issued when no dependency exists & all older store addresses known = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = address calculation+ translation

  14. Unified Load/Store Queue Questions • When do we search for store-to-load forwarding? • As soon as we have the load address • What could happen once we have the load address? • There is a store whose data we’ll use • There is no store whose data we’ll use • We aren’t sure which store’s data, if any, we’ll use. • What should we do for each of those three cases?

  15. Split LQ and SQ D$/TLB + structures to handle in-flight loads/stores Performs four functions In-order store retirement • Writes stores to D$ in order • Basic, implemented by store queue (SQ) Store-load forwarding • Allows loads to read values from older un-retired stores • Data provided to LQ from SQ. Memory ordering violation detection • Checks load speculation (more later) • Advanced, implemented by load queue (LQ) Memory ordering violation avoidance • Advanced, implemented by dependence predictors

  16. Simple Data Memory FU: D$/TLB + SQ data in data out address Just like any other FU • 2 register inputs (addr, data in) • 1 register output (data out) • 1 non-register input (load pos)? Store queue (SQ) • In-flight store address/value • In program order (like ROB) • Addresses associatively searchable • Size heuristic: 15-20% of ROB But what does it do? load position Store Queue (SQ) address value head == age == == == == tail == == == D$/TLB

  17. “Out-of-Order” Load Execution data in data out address In parallel with D$ access Send address to SQ • Compare with all store addresses • CAM: like FA$, or RS tag match • Select all matching addresses Age logic selects youngest store that is older than load • Uses load position input • Three possibilities • There is a store we can forward from • Do so! • There is no store we can forward from • Get from D$ • We don’t know. • ????? load position Store Queue (SQ) address value head == age == == == == tail == == == D$/TLB

  18. D$/TLB + SQ + LQ Load queue (LQ) • In-flight load addresses • In program-order (like ROB,SQ) • Associatively searchable • Size heuristic: 20-30% of ROB store position flush? load queue(LQ) SQ address head head == age == == == == == == == tail == == tail == == == == == == D$/TLB

  19. Conservative Memory Scheduling • Schedule loads and stores when all dependencies known satisfied • Conservative - won’t violate true dependencies (non-speculative) • Capabilities/limitations: • Accurate only if addresses arrive early • Late presentation of dependencies (verified with addresses) • Not fast, all communication through memory and/or complex store forward network • Better for small windows Dependencies true realized st X ld Y program order st?Z ld X ld Z

  20. Conservative Dataflow Scheduling Dependencies time true realized st X st X st X st X st X st X st X ld Y ld Y ld Y ld Y ld Y ld Y ld Y Z program order st?Z st?Z st?Z st?Z st Z st Z st Z ld X ld X ld X ld X ld X ld X ld X ld Z ld Z ld Z ld Z ld Z ld Z ld Z stall cycle

  21. Opportunistic Memory Scheduling • Observe: on average, < 10% of loads forward from SQ • Even if older store address is unknown, chances are it won’t match • Let loads execute in presence of older “ambiguous stores” • Increases performance • But what if ambiguous store does match? • Memory ordering violation: load executed too early • Must detect… • And fix (e.g., by flushing/refetchinginstruction starting at load)

  22. Opportunistic Memory Speculation • Schedule loads and stores when register dependencies satisfied • May violate true data dependencies (speculative) • Capabilities/limitations: • Accurate - if little in-flight communication through memory • Early presentation of dependencies (no dependencies!) • Not fast, all communication through memory structures • Most common with small windows Dependencies true realized st X ld Y program order st Z ld X ld Z

  23. Less aggressive speculation • Once you have a load address, check if you will forward for certain (from somewhere, even if you aren’t sure where) • If not, get from memory/D$ • Don’t send to CDB until certain.

  24. Yet another idea… • Let’s use a predictor • Predict if a load is likely to be forwarding from a store • If not, get it from D$ and send it to the CDB • If so, wait. • Thoughts on cost/benefit of speculation? • How should that impact our predictor? “Memory dependence prediction”

More Related