180 likes | 305 Views
Characterization and Evaluation of Hardware Loop Unrolling. Marcos R. de Alba and David R. Kaeli. BARC 2003 Cambridge, MA January 30, 2003. Motivation.
E N D
Characterization and Evaluation of Hardware Loop Unrolling Marcos R. de Alba and David R. Kaeli BARC 2003 Cambridge, MA January 30, 2003
Motivation • High temporal locality available in loops suggests applying more aggressive fetch techniques to provide a larger number of instructions for dispatch and issue • Current aggressive fetch techniques (e.g., trace caches) are not tuned to exploit loop behavior • We propose a mechanism specifically tailored to fetching loop bodies DE ALBA, KAELI BARC 2003
Outline • Introduction • Loop characteristics • Loop prediction hardware • Loop caching and unrolling hardware • Experimental approach • Results • Conclusions and current work DE ALBA, KAELI BARC 2003
Introduction • To exploit instruction level parallelism, it is essential to have a large window of candidate instructions available to issue from • The temporal locality present in loops provides a good opportunity for loop caching • In general-purpose applications, 50% of the loops have variable-dependent trip counts and/or contain conditional branches in their bodies • These characteristics suggest that a hardware-based loop caching approach should be investigated DE ALBA, KAELI BARC 2003
Loop characteristics • internal control flow • number of loop visits • number of iterations per loop visit • dynamic loop body size • patterns leading up to the loop visit DE ALBA, KAELI BARC 2003
Loop prediction hardware • A path-to-loop register to detect loops in advance • A stack to maintain nested per-iteration loop information • A table to maintain per-visit loop information and update loop prediction state • Path-in-iteration table to maintain history of branches visited within individual iterations DE ALBA, KAELI BARC 2003
Loop characteristics and hardware components * Based on a study of SPECint2000, MiBench and MediaBench DE ALBA, KAELI BARC 2003
Loop stack Path-in-loop table Head address Tail address path-to-loop *pilt path-in-loop itns next 101 0x2d24 0x2d68 001 2 2 . . 000 2 3 1__ 0 0 Loop prediction table Path-in-loop prediction table Head address Tail address Predicted path-in-loop preditns conf ctr path-to-loop *pilt next 001 0x2d24 0x2d68 101 2 3 2 000 . . 2 3 3 1__ 0 3 0 DE ALBA, KAELI BARC 2003
Loop caching and unrolling hardware • A loop cache to hold instructions that belong to loop bodies • A loop cache control mechanism for indexing into the loop cache and for maintaining loop cache state (number of allocated loops, their indices and offsets) DE ALBA, KAELI BARC 2003
Path-to-loop Loop prediction table Path-in-loop table 001 2 2 bn-1 bn-2 bn-3 ... b1 tag head tail preditns *pilt 000 2 3 50 2d24 2d68 4 1__ 0 0 index Gshare Y N tag match last branch ? b0 address N Y preditns > 1 ? The information is used used by the loop cache control to interrogate the loop cache for a hit or to build dynamic traces in the case of a miss There is no information for this loop, proceed with normal fetching
2d24 2d68 Loop cache control mechanism Loop cache tag instructions ... index 50 ... ... ... ... . . . . . . from loop prediction table N Y match? store loop pattern in the loop cache issue instructions from loop cache DE ALBA, KAELI BARC 2003
Loop cache Loop cache control (2d24, 2d68, 4, 001, 001, 000, 000) Loop body 2d24: ldl t1, 16(sp) 2d28: lda t1, -31(t1) 2d2c: bge t1, 2d6c 2d30: ldl v0, 16(sp) 2d34: lda v0, -15(v0) 2d38: bge v0, 2d50 2d3c: ldl t2, 0(sp) 2d40: ldl t0, 32(sp) 2d44: subl t0, t2, t0 2d48: br zero, 2d5c 2d4c: ldl v0, 32(sp) 2d50: subl v0, 0x1, v0 2d54: stl v0, 32(sp) 2d58: ldl t0, 16(sp) 2d5c: addl t0, 0x1, t0 2d60: stl t0, 16(sp) 2d68: br zero, 2d24 Unrolled loop according to information from loop predictor (assumed 4 instructions/line)
Experimental approach • Modified Simplescalar 3.0c Alpha EV6 pipeline to model the following features: • loop detection • loop prediction • loop cache filling • loop cache/I-cache multiplexing • loop termination detection • loop stack operations • loop table operations DE ALBA, KAELI BARC 2003
Modifications to SimpleScalar Loop predictor Register scheduler Memory scheduler Exec Mem Write back Fetch Dispatch Commit Loop cache D-Cache (DL1) I-Cache (IL1) ITLB DTLB D-Cache (DL2) I-Cache (IL2) Virtual memory
path-to-loop: prediction rate of entering the loop using the path-to-loop iterations: prediction rate of number of iterations per entered loop visit path-in-itn: prediction rate of paths-in-iteration per loop iteration speedup: relative CPI gain compared to no loop prediction DE ALBA, KAELI BARC 2003
Conclusions and current work • Above 50 % of loops have properties that make them highly predictable and attractive for aggressive fetching • Compare efficiency of loop cache against trace cache • Propose a hybrid fetch approach utilizing the loop cache for loop bodies and the trace cache for all non-in-loop instructions DE ALBA, KAELI BARC 2003