OOO Execution of Memory Operations

OOO Execution of Memory Operations

P6 Caches • Blocking caches severely hurt OOO • A cache miss prevents from other cache requests (which could possibly be hits) to be served • Hurts one of the main gains from OOO – hiding caches misses • Both L1 and L2 cache in the P6 are non-blocking • Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests • Support up to 4 outstanding misses • Misses translate into outstanding requests on the P6 bus • The bus can support up to 8 outstanding requests • Squash subsequent requests for the same missed cache line • Squashed requests not counted in number of outstanding requests • Once the engine has executed beyond the 4 outstanding requests • subsequent load requests are placed in the load buffer

OOO Execution of Memory Operations • The RS operates based on register dependencies • RS cannot detect memory dependencies movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx movl %eax, -4(%ebp) # eax ← MEM[ebp-4] • RS dispatches memory uops when data for address calculation is ready, and the MOB and Address Generation Unit (AGU) are free • AGU computes the linear address Segment-Base + Base-Address + (Scale*Index) + Displacement • Sends linear address to MOB, to be stored in Load Buffer or Store Buffer • MOB resolves memory dependencies and enforces memory ordering • Some memory dependencies can be resolved statically store r1,a load r2,b • Problem: some cannot store r1,[r3]; load r2,b can advance load before store load must wait till r3 is known

Load and Store Ordering • x86 has small register set  uses memory often • Preventing Stores from passing Stores/Loads: 3%~5% perf. loss • P6 chooses not allow Stores to pass Stores/Loads • Preventing Loads from passing Loads/Stores: big perf. loss • P6 allows Loads to pass Stores, and Loads to pass Loads • Stores are not executed OOO • Stores are never performed speculatively • there is no transparent way to undo them • Stores are also never re-ordered among themselves • The Store Buffer dispatches a store only when • the store has both its address and its data, and • there are no older stores awaiting dispatch • Store commits its write to memory (DCU) at retirement

Store Implemented as 2 Uops • Store decoded as two independent uops • STA (store-address): calculates the address of the store • STD (store-data): stores the data into the Store Data buffer • The actual write to memory is done when the store retires • Separating STA & STD is important for memory OOO • Allows STA to dispatch earlier, even before the data is known • Address conflicts resolved earlier opens memory pipeline for other loads • STA and STD can be issued to execution units in parallel • STA dispatched to AGU when its sources (base+index) are ready • STD dispatched to SDB when its source operand is available

Memory Order Buffer (MOB) • Store Coloring • Each Store allocated in-order in Store Buffer, and gets a SBID • Each load allocated in-order in Load Buffer, and gets LBID + current SBID • Load is checked against all previous stores • Stored with SBID ≤ store’s SBID • Load blocked if • Unresolved address of a relevant STAs • STA to same address, but data not ready • Missing resources (DTLB miss, DCU miss) • MOB writes blocking info into load buffer • Re-dispatches load when wake-up signal received • If Load is not blocked  executed (bypassed)

MOB (Cont.) • If a Load misses in the DCU • The DCU marks the write-back data as invalid • Assigns a fill buffer to the load, and issues an L2 request • When critical chunk is returned, wakeup and re-dispatch the load • Store → Load Forwarding • Older STA with same address as load and data ready  Load gets its data directly from the SB (no DCU access) • Memory Disambiguation • MOB predicts if a load can proceed despite unknown STAs • Predict colliding  block Load if there is unknown STA (as usual) • Predict non colliding  execute even if there are unknown STAs • In case of wrong prediction • The entire pipeline is flushed when the load retires

Pipeline: Load: Allocate Alloc Schedule Retire • Allocate ROB/RS, MOB entries • Assign Store ID (SBID) to enable ordering IDQ RS ROB LB AGU LB Write MOB DTLB DCU WB

Pipeline: Bypassed Load: EXE Alloc Schedule AGU LB Write Retire • RS checks when data used for address calculation is ready • AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. • Write load into Load Buffer • DTLB Virtual → Physical + DCU set access • MOB checks blocking and forwarding • DCU read / Store Data Buffer read (Store → Load forwarding) • Write back data / write block code IDQ RS ROB MOB DTLB DCU WB LB

Pipeline: Blocked Load Re-dispatch Alloc Schedule AGU LB Write Retire • MOB determines which loads are ready, and schedules one • Load arbitrates for MEU • DTLB Virtual → Physical + DCU set access • MOB checks blocking/forwarding • DCU way select / Store Data Buffer read • write back data / write block code IDQ RS ROB MOB DTLB DCU WB LB

Pipeline: Load: Retire Alloc Schedule AGU LB Write Retire • Reclaim ROB, LB entries • Commit results to RRF IDQ RS ROB MOB DTLB DCU WB LB

SB Pipeline: Store: Allocate Alloc Schedule AGU SB Retire • Allocate ROB/RS • Allocate Store Buffer entry IDQ RS ROB DTLB

SB Pipeline: Store: STA EXE Alloc Schedule AGU SB V.A. Retire • RS checks when data used for address calculation is ready • dispatches STA to AGU • AGU calculates linear address • Write linear address to Store Buffer • DTLB Virtual → Physical • Load Buffer Memory Disambiguation verification • Write physical address to Store Buffer IDQ RS ROB DTLB SB P.A.

SB Pipeline: Store: STD EXE Alloc Schedule SB data Retire • RS checks when data for STD is ready • dispatches STD • Write data to Store Buffer IDQ RS ROB

SB Pipeline: Senior Store Retirement Alloc Schedule MOB SB DCU Retire • When STA (and thus STD) retires • Store Buffer entry marked as senior • When DCU idle  MOB dispatches senior store • Read senior entry • Store Buffer sends data and physical address • DCU writes data • Reclaim SB entry IDQ RS ROB

R3MEM(R2+50) The life of a Load… Instruction Q RAT MOB Store Buffer R0 Arch Reg. R1 R2 R3 RF0 RS load Buffer RF0MEM(R2+50) ROB Addr. BC V 1 V(R2+50) Not Valid 0 Data Cache Phys. Reg. • Ld1 0 X R3 • Ld1 1 data R3 … ALU1 AGU dTLB EXE R2+50 Retire • 1 entry in the ROB, RS and Load Buffer + rename in RAT • Dispatch Load address calculation to AGU when source is ready – Release RS entry • AGU updates the address in the Load buffer. Pipeline proceeds to dTLB • Load Buffer checks for blocking conditions and dispatches the Load to the DCU • DCU sends the result to RS and updates the ROB with the load result • Load will retire as any other instruction (when all previous instructions have retired) – RAT updated • LB and ROB entry are released

MEM(R2+50)  R3 The life of a Store… RAT MOB Instruction Q Store Buffer R0 Arch Reg. V Addr. Data Snr R1 Data Cache 1 Not Vld Not Vld V(RF0) 1 V(R2+50) R2 R3 RF0 RS load Buffer STA: R2+50 ROB Addr. BC V STD: RF0 Phys. Reg. • St 1 0 X X • St 1 1 X X … ALU1 AGU dTLB EXE R2+50 Retire • 1 entry in the ROB, 2 in the RS and 1 in the Store Buffer • Dispatch Store address calculation to AGU when source is ready – Release RS entry • AGU updates the address in the Store buffer  update the Store Buffer & provide addr. to depending loads • Store pipeline proceeds to dTLB. Physical address will be updated in the SB • Dispatch Store Data when Data is ready  update the Store Buffer & provide data to depending loads • The Store Buffer updates the ROB entry • The Store will retire from the ROB as any other instruction (when all previous instructions have retired) • After this, the Store is marker as Senior Store in the Store Buffer • The Store buffer will initiate a DCU write. When the write is done, the SB reclaims the entry

Question • בשאלה זו נתייחס למעבד עם OOOE ו- Speculative Execution • נתון קטע הקוד הבא: 1000 load R2,R1,30; R2=m[R1+30] 1004 store R2,20,R1; m[R2+20]=R1 1008 load R3,R1,100; R3=m[R1+100] 100C store R1,40,R3; m[R1+40]=R3 1010 add R1,R1,10; R1=R1+10 1014 blt R1,100,1000; if (R1<100) PC=1000 • הנחות • פקודת הקפיצה נחזית כנלקחת • בתחילת הביצוע בכל כתובת N בזיכרון קיים הערך,N וכן R1=R2=R3=10 • למען פשטות נניח כי הכתובות בתוכנית הן פיזיות ואין צורך בתרגום. • L1 data cache מחזירdata תוך מחזור שעון אחד, אך הוא ריק בתחילת הביצוע. • L2 data cache מחזירdata תוך 7 מחזורי שעון, והוא מכיל את כל הכתובות המבוקשות כבר בתחילת הביצוע.

אלוקציה של פקודות • בכל מחזור ניתן לבצע אלוקציה לארבע פקודות (ויש לפחות 4 פקודות מוכנות לאלוקציה) • ה-ROB, MOB, וה- RS הם גדולים ואינם מתמלאים.

ביצוע של פקודות • ישנן אינסוף יחידות ביצוע. • פקודה יכולה להיכנס לביצוע במחזור שלאחר האלוקציה, בתנאי שכל הנתונים להם היא זקוקה כבר מוכנים. פקודה שממתינה לנתון יכולה להיכנס לביצוע מייד במחזור שלאחריו הוא מוכן. • ביצוע פקודת ALU אורך מחזור שעון אחד. • ביצוע פקודת branch אורך מחזור אחד. • אם החיזוי מתגלה כשגוי, במחזור הבא מבוצע flush (בזמן t+1). • הפקודות מהמסלול הנכון מבצעות אלוקציה 5 מחזורים לאחר flush (בזמן t+6).

ביצוע של פקודות – המשך • פקודת load נשלחת לביצוע כאשר הנתונים לחישוב הכתובת מוכנים. • במחזור הראשון מחושבת הכתובת • במחזור השני נבדק התנאי הבא: עבור כל פקודת store הקודמת ל-load, הכתובת של ה-store ידועה ומתקיים: או שהכתובת של ה-load שונה מהכתובת של ה-store, או ששתי הכתובות שוות, וה-data של ה-store כבר ידועה. • במחזור השלישי, במידה והבדיקה מצליחה, הנתון מתקבל מה-L1 cache (אם יש hit), או ישירות מה-MOB ע"י store to load forwarding • במידה והבדיקה מצליחה אך יש L1 cache miss וכן אין store to load forwarding, הנתון מתקבל במחזור העשירי מה- L2 cache. • במידה והבדיקה נכשלת, ה-load הוא חסום (blocked). כאשר מוסר תנאי החסימה, ה-load נשלח שוב לביצוע, ומדלגים על המחזור הראשון (מתחילים בבדיקת התנאי). • פקודת store נשלחת לביצוע כאשר הנתונים לחישוב הכתובת מוכנים. • חישוב הכתובת אורך מחזור שעון אחד, ובסופו נכתבת הכתובת ל-MOB. • באופן בלתי תלוי, כאשר הנתון לכתיבה לזיכרון מוכן, במחזור הבא הוא נכתב ל-MOB

של פקודותCommit • פקודה יכולה לבצע commit החל מהמחזור שלאחר סיום הביצוע, ובתנאי שהפקודה שלפניה ביצעה/מבצעת commit. אין מגבלה על כמות הפקודות שמבצעות commit בכל מחזור • פקודת store מבצעת את הכתיבה אל ה-cache בזמן post-commit.

Summary… • 4 wide machine • L1: 1 cycle L2: 7 cycles Alu, Branch: 1 cycle • L1 empty / L2 always hits… • Mispredict @ T: • T+1: Flush pipeline • T+6: Alloc on the good path 4 3 2 1 10 … L2 Hit L1 Hit Forwarding (From MOB) Addr. calculation Memory checks Load 7 cycles • All previous • Store: • ≠ addr. • Same addr& data Rdy Retry after block Addr. calculation Store Data Ready MOB update

0: ready 1: addr blocking 2: data not ready Time exe Time Src ready Srcreg: Pi / Ri: Store: Src1: addr Src2: data Alloc Time 4 / cycle Data for LD & ST Arch. reg value after commit Addr. for LD & ST Fill this table…

הנחיות למילוי הטבלה • R1, R2, R3 - ערכי הרגיסטרים הארכיטקטוניים לאחר commit.יש להקיף בעיגול את הערך של הרגיסטר הארכיטקטוני שאליו הפקודה כותבת.במידה והפקודה אינה מגיעה ל-commit יש להשאיר שדות אלה ריקים. • addr– כתובת הגישה לזיכרון – עבור פקודות load ו-store בלבד. • data – ערך זיכרון שנקרא או נכתב – עבור פקודות load ו-store בלבד. • T alloc: הזמן בו מבוצעת אלוקציה לפקודה (ארבע פקודות בכל מחזור, החל מ- T=1) • src1, src2: מספרי הרגיסטרים המשמשים כ-sources לפקודה:Pi עבור רגיסטר פיזי, ו-Ri במידה וקוראים ישירות את הרגיסטר הארכיטקטוני. • עבור store: src1 – הרגיסטר המשמש לחישוב הכתובת. src2 – הרגיסטר המכיל את הנתון. • Imm– במידה ולפקודה יש Imm, ערך ה- Imm. • T src1 ready , T src2 ready: הזמן בו מוכן כל אחד ערכי ה-sources לפקודה.אם ה-src מוכן בזמן האלוקציה, אז זמן זה יהיה שווה לזמן האלוקציה.אם הפקודה שמחשבת את הערך של src מסיימת ביצוע בזמן T, ה-src מוכן בזמן T.

הנחיות למילוי הטבלה – המשך • R1, R2, T exe: הזמן בו הפקודה נשלחת לביצוע.אם כל ה-src-ים של פקודה מוכנים בזמן T, ניתן לשלוח את הפקודה לביצוע בזמן T+1. • Load block code (רלוונטי רק בפקודות load): קוד החסימה של ה-load.0 – אין חסימה.1 – חסימה כתוצאה מ-unresolved store address 2 – חסימה כתוצאה מ- waiting for store data • במידה וה-load נחסם יותר מפעם אחת, יש לרשום את כל קודי החסימה. • T data ready: • עבור store: הזמן בו ה-data לכתיבה לזיכרון מוכן. • עבור load: הזמן בו מתקבל ה-data (מה-cache או ישירות מה-MOB). • T commit: הזמן בו הפקודה מבצעת commit

0: ready 1: addr blocking 2: data not ready Srcreg: Pi / Ri: Store: Src1: addr Src2: data 14 13 12 11 21 … R2 (pb0) is known . . . Load . . Addr.Calc: PB0+20 Store L2 Hit L1 miss Memory checks Load

0: ready 1: addr blocking 2: data not ready Srcreg: Pi / Ri: Store: Src1: addr Src2: data .

0: ready 1: addr blocking 2: data not ready Srcreg: Pi / Ri: Store: Src1: addr Src2: data

wrong

OOO Execution of Memory Operations