1 / 18

Lecture 11: Memory Data Flow Techniques

Lecture 11: Memory Data Flow Techniques. Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation. Load: LW R2, 0(R1) Generate virtual address; may wait on base register Translate virtual address into physical address Write data cache. Store: SW R2, 0(R1)

reba
Download Presentation

Lecture 11: Memory Data Flow Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 11: Memory Data Flow Techniques Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation

  2. Load: LW R2, 0(R1) Generate virtual address; may wait on base register Translate virtual address into physical address Write data cache Store: SW R2, 0(R1) Generate virtual address; may wait on base register and data register Translate virtual address into physical address Write data cache Load/Store Execution Steps Unlike in register accesses, memory addressesare not known prior to execution

  3. Support memory-level parallelism Loads wait in load buffer until their address is ready; memory reads are then processed Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed Load/store Buffer in Tomasulo IM Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS DM FU1 FU2

  4. Centralized RS includes part of load/store buffer in Tomasulo Loads and stores wait in RS until there are ready Load/store Unit with Centralized RS IM Fetch Unit Reorder Buffer Decode Rename Regfile RS S-unit L-unit FU1 FU2 data addr addr Store buffer cache

  5. for (i=0;i<100;i++) A[i] = A[i]*2; Loop: L.S F2, 0(R1) MULT F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop F4 store 2.0 Significant improvement from sequential reads/writes Memory-level Parallelism LW1 LW2 LW3 SW1 SW2 SW3

  6. Memory Consistency Memory contents must be the same as by sequential execution Must respect RAW, WRW, and WAR dependences Practical implementations: • Reads may proceed out-of-order • Writes proceed to memory in program order • Reads may bypass earlier writes only if their addresses are different

  7. Wait in RS until base address and store data are available (ready) Move to store unit for address calculation and address translation Move to store buffer (finished) Wait for ROB commit (completed) Write to data cache (retired) Stores always retire in for WAW and WRA Dep. Store Stages in Dynamic Execution RS Store unit Load unit finished completed D-cache Source: Shen and Lipasti, page 197

  8. Load Bypassing and Memory Disambiguation To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences Dynamic Memory Disambiguation: Dynamic detection of memory dependences • Compare load address with every older store addresses

  9. Load Bypassing Implementation RS in-order 1. address calc. 2. address trans. 3. if no match, updatedest reg Associative search formatching Load unit Store unit 1 1 2 2 match 3 data addr D-cache Assume in-order executionof load/stores addr data

  10. Load Forwarding: if a load address matches a older write address, can forward data If a match is found, forward the related data to dest register (in ROB) Multiple matches may exists; last one wins Load Forwarding RS in-order Load unit Store unit 1 1 2 2 match 3 To dest.reg data addr D-cache addr data

  11. Any store in RS station may blocks all following loads When is F2 of SW available? When is the next L.S ready? Assume reasonable FU latency and pipeline length In-order Issue Limitation for (i=0;i<100;i++) A[i] = A[i]/2; Loop: L.S F2, 0(R1) DIV F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop

  12. Speculative Load Execution RS Forwarding does not always work if some addresses are unknown No match: predict a load has no RAW on older stores Flush pipeline at commit if predicted wrong out-order Load unit 1 1 Store unit 2 2 match 3 Match at completion addr data addr data Finishedload buffer D-cache data If match: flush pipeline

  13. Alpha 21264 Pipeline

  14. Alpha 21264 Load/Store Queues Int issue queue fp issue queue AddrALU IntALU IntALU AddrALU FPALU FPALU Int RF(80) Int RF(80) FP RF(72) D-TLB L-Q S-Q AF Dual D-Cache 32-entry load queue, 32-entry store queue

  15. Load Bypassing, Forwarding, and RAW Detection commit match Load/store? LQ SQ ROB Load: WAIT if LQ head not completed, then move LQ head Store: mark SQ head as completed, then move SQ head completed IQ IQ If match: forwarding D-cache D-cache If match: mark store-load trapto flush pipeline (at commit)

  16. Speculative Memory Disambiguation Fetch PC Load forwarding 1024 1-bitentry table Renamed inst 1 int issue queue • When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC • When the load is fetched, get its stWait from the table • The load waits in issue queue until old stores are issued • stWait table is cleared periodically

  17. Memory request: search the hierarchy from top to bottom Architectural Memory States LQ SQ Committed states Completed entries L1-Cache L2-Cache L3-Cache (optional) Memory Disk, Tape, etc.

  18. Summary of Superscalar Execution • Instruction flow techniques Branch prediction, branch target prediction, and instruction prefetch • Register data flow techniques Register renaming, instruction scheduling, in-order commit, mis-prediction recovery • Memory data flow techniques Load/store units, memory consistency Source: Shen & Lipasti

More Related