230 likes | 375 Views
Software-Hardware Cooperative Memory Disambiguation. Ruke Huang, Alok Garg, and Michael Huang Department of Electrical & Computer Engineering University of Rochester. Motivation. Hiding long latencies Scaling up of many structures Complex, hard to design Consumes more energy Slower
E N D
Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok Garg, and Michael Huang Department of Electrical & Computer Engineering University of Rochester "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Motivation • Hiding long latencies • Scaling up of many structures • Complex, hard to design • Consumes more energy • Slower • Inefficiency in hardware • Meticulously keep track of all instructions • No prior knowledge of out-of-order execution • Simply cross-compare all loads and stores 16% LQ Size ROB size: 320 SQ size: 48 LQ size: 48 "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Software Assistance • Global information • Statically identify non-conflicting memory accesses • Advantages • Reduced resource pressure • Energy savings • Loads not requiring memory disambiguation • Average 43% dynamic loads in FP Spec applications "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Recent Research • Chrysos and Emer (ISCA’98) • Sethumadhavan et al. (MICRO’03) • Park et al. (MICRO’03) • Baugh and Zilles (PACC’04) • Akkary et al. (MICRO’03) • Gandhi et al. (ISCA’05), etc. Hardware-only: Provisioning, re-occurring overhead Cooperative: Consumption, one-time overhead "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Outline • Cooperative Memory Disambiguation • Framework • Evaluation • Conclusion "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Cooperative Memory Disambiguation- Resource-Effective Approach • 90% dynamic loads do not communicate with in-flight stores • Many loads do not require memory disambiguation resources • Safe loads: Software analyzer can identify them • Can exploit hardware specific information • Hardware resources only for non-safe loads int A[1000], B[1000]; void VecAdd() { for(int i=0; i<1000; i++) A[i] = A[i] + B[i]; } "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Source compiler Compilation Original binary Hardware specific translator ISA Translator Hardware specific internal binary Hardware Extended instruction set Cooperative Memory Disambiguation Framework • Software-hardware Interface • Decoupled ISA (No compatibility obligations) • Software Support • Binary to binary translator - alto (Muth et al.) • Binary analyzer • Identify read-only data loads • Identify other general safe loads • Architectural Support • Light-weight "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Instruction window General Safe Loads … Load Load … Store Branch … … Store … • Scope of parser analysis • Steady state loop • No internal control flow • Limited in-flight instructions • ROB size, store queue size i-2 … … Store … i-1 Simple loop body Load … Store … i Steady state loop execution "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
General Safe Loads (Cont.)-Real example from a SPEC FP application 0x120033140: ldl r31, 256(r3) ; prefetch 0x120033144: ldt f21, 0(r3) ; Ld1 0x120033148: lda r27, -2(r27) ; r27 = r27-2 0x12003314c: lda r3, 16(r3) ; r3 = r3+16 0x120033150: ldt f22, -8(r3) ; Ld2 0x120033154: ldt f23, 0(r11) ; Ld3 0x120033158: cmple r27, 0x1, r1 ; 0x12003315c: lda r11, 16(r11) ; r11 = r11+16 0x120033160: ldt f24, -8(r11) ; Ld4 0x120033164: lds f31, 240(r11) ; prefetch 0x120033168: mult f20, f21, f21 ; 0x12003316c: mult f20, f22, f22 ; 0x120033170: addt f23, f21, f21 ; 0x120033174: addt f24, f22, f22 ; 0x120033178: stt f21, -16(r11) ; St1 0x12003317c: stt f22, -8(r11) ; St2 0x120033180: beq r1, 0x120033140 ; 0x120033140: ldl r31, 256(r3) ; prefetch 0x120033144: ldt f21, 0(r3) ; Ld1 0x120033148: lda r27, -2(r27) ; r27 = r27-2 0x12003314c: lda r3, 16(r3) ; r3 = r3+16 0x120033150: ldt f22, -8(r3) ; Ld2 0x120033154: ldt f23, 0(r11) ; Ld2 0x120033158: cmple r27, 0x1, r1 ; 0x12003315c: lda r11, 16(r11) ; r11 = r11+16 0x120033160: ldt f24, -8(r11) ; Ld4 0x120033164: lds f31, 240(r11) ; prefetch 0x120033168: mult f20, f21, f21 ; 0x12003316c: mult f20, f22, f22 ; 0x120033170: addt f23, f21, f21 ; 0x120033174: addt f24, f22, f22 ; 0x120033178: stt f21, -16(r11) ; St1 0x12003317c: stt f22, -8(r11) ; St2 0x120033180: beq r1, 0x120033140 ; AddrLd1=_R3+16*i AddrLd2=_R11+16*i AddrSt1=_R11+16*i AddrSt2=_R11+16*i+8 Analysis window: 16 iterations Address range = _R11+(i-16)*16 to _R11+(i-1)*16+8 Ld2 statically determined to be safe Ld1 need run-time evaluation One loop from galgel "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
General Safe Loads (Cont.)-Real example from a SPEC FP application New_entry: mark_sq if(r3-r11+8>0) or (r3-r11+264<0) then cset CR0, 1 0x120033144: sldt f21, 0(r3), [CR0] ; Ld1 (safe) 0x12003314c: lda r3, 16(r3) ; r3 = r3+16 0x120033154: sldt f23, 0(r11), [CR_TRUE] ; Ld2 (safe) 0x120033158: cmple r27, 0x1, r1 ; 0x12003315c: lda r11, 16(r11) ; r11 = r11+16 0x120033174: addt f24, f22, f22 ; 0x120033178: stt f21, -16(r11) ; St1 0x12003317c: stt f22, -8(r11) ; St2 Modified Code "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Safe stores • Safe stores • If it does not communicate with future loads • Indirectly discover safe loads • Un-analyzable store • Load is safe if all stores in SQ are safe • Summary of safe load detection • Simple loop body • All stores must be analyzable • Address range calculation … Load (A) … Store1 (UA) … Store2 (A) … Branch … Load (A) … Store1 (UA) … Store2 (A) … Branch … Load (A) ... Loop Body In-flight instructions "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Architectural Support • Safe loads • Boolean condition registers • cset (instruction) • Safe stores • Scope marker • Indirect jumps • Flash-reset all condition registers "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Outline • Cooperative Memory Disambiguation • Framework • Evaluation • Conclusion "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Experimental Setup • Modified SimpleScalar 3.0b simulator • Wattch to estimate dynamic energy consumption • SPEC CPU2000 benchmark suite "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Breakdown of Safe Loads (FP) 97% 43% "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Performance Improvement (FP) 40/48% "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Breakdown of Safe Loads (INT) "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Performance Improvement (INT) "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Energy Savings Floating-point applications Integer applications "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Conclusions • Software assistance improves LSQ efficiency • Detects average 43% loads as safe • Average 10% performance gain • Compiler techniques for optimization of micro-architecture resources • Future work • More powerful static analyzer • Manage other micro-architecture resources • E.g., register file "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Thank you! Questions? "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Support for Coherency Hash Table: 2-bit • Total entries: 512 • Details: http://www.ece.rochester.edu/~mihuang/PAPERS/hpca06tr.pdf Table 1 Table 2 Access bit Invalidation bit "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006
Read-Only Data Loads • Alpha COFF binary header • Global pointer (GP) • Read-only sections • Access address calculation • Algorithm - extended constant propagation gp=0x120022000 Read-Only Section Start: 0x120023000 End: 0x120024000 "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006