340 likes | 517 Views
Survey of Low-Complexity, Low Power Instruction Scheduling. Alex Li Lewen Lo Sara Sadeghi Baghsorkhi. Motivation. Scalability of instruction window size Extract greater ILP Power consumption CAM logic is power hungry Complexity
E N D
Survey of Low-Complexity, Low Power Instruction Scheduling Alex Li Lewen Lo Sara Sadeghi Baghsorkhi
Motivation • Scalability of instruction window size • Extract greater ILP • Power consumption • CAM logic is power hungry • Complexity • Wire delay of associative logic dominates gate delay in scheduler
Outline • Wakeup logic optimizations • Distributed instruction queues • Waiting Instruction Buffer • Preschedulers • Cyclone • Wakeup-Free
Wakeup Logic Result Reg Tag Wakeup Logic = = Opcode FU Type Dest Reg V1 Src Reg 1 V2 Src Reg 2 R ••• To Select Logic
Gated Tag Matching tail head • Rationale • Parts of IQ wasting energy • Energy-wasting sources • Empty area • Ready operands • Issued instructions • Solution • Gate the comparators! Folegnani et al. ISCA2001
Gated Tag Matching • Furthermore… • young instr. contributes little to performance • Solution: Dynamic resizing • Use limit pointer & performance counters • Reduce size as long as < IPC threshold • Increase size if > IPC threshold for a set period • Cost: Additional logic for • Gated comparators • Performance counter • Claims • 128-entry queue, effective size ~ 43 • 4% performance loss • 90.7% wakeup logic energy savings • 14.9% chip energy savings • Significant energy savings • Based on conventional design, no performance benefit limit tail head Folegnani et al. ISCA2001
Tag Elimination • Rationale • 1 ready operand in most instr. (80-96%) • Last arriving operand wakes up instr. • Base Approach • Issue window w/ 2-, 1-, and 0-comparator entries • Insert instr. based on operand readiness • Advanced Approach • Eliminate 2-comparator entries • Predict last-arriving operand • Re-issue on misprediction • Results on (32 1-comp/32 0-comp) • Slight IPC loss (1-3%), • Account for reduced delay, good speedup (25-45%) • 65-75% lower energy-delay product • Drastically reduce associative logic (1/4) • reduce energy • no performance impact (even speedup) = = 2-comp entries = = = = 1-comp entries = = 0-comp entries Ernst et al. ISCA2002
N-use Issue Logic • Rationale • 1 (or few) dep. instr. for most instr (75-78%) • Approach • More SRAM (N-use table) • Less CAM (I-buffer) • Wakeup dependents only • Claims • 2-use table, 2-entry I-buffer comparable to 64-entry CAM (~4% slowdown) • 96 regs 192 entries in 2-use table! • Justifications • DOES reduce CAM (64 to 2 cells) • Energy to support 2-use table gated entries • Less complex, but maybe more area • Cycle time may be reduced • Drastically different design Canal et al. ICS2001
Distributed Instruction Queue(FIFO) • Instructions in a queue are a dependence chain. • Only instructions at the head of the queues can be ready. • Works well for INT codes, but poor for FP codes. • Large number of FIFOs increases its complexity Palacharla et al 97
Distributed Instruction Queue(Buffer) • Multiple dependence chains share a queue. • Queues are not FIFOs but they do not require wake-up. • Different Dispatch Order and Issue Order • Use latencies at issue time to decide which will be the next selected instruction. • Still a Simple Selection Logic • Same Performance and Less Power Consumption Abella and Gonzalez 04
Selection Logic Abella and Gonzalez 04
Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer LD r1, 1024(r0) ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 ADD r5, r3, r4 Instruction Dispatch Issued Data Cache LD r6, 256(r5) Functional Unit ADD r6, r6, r0 Lebeck et al 02
Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 Cache Miss ADD r5, r3, r4 LD r6, 256(r5) Instruction Dispatch Data Cache ADD r6, r6, r0 Functional Unit Load miss on r1 Lebeck et al 02
Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 Cache Miss ADD r5, r3, r4 LD r6, 256(r5) Instruction Dispatch Data Cache ADD r6, r6, r0 Functional Unit Lebeck et al 02
Waiting Instruction Buffer r3, r4 Issue Queue Waiting Instruction Buffer SLL r3, 0x4, r3 SUB r4, r4, r2 ADD r5, r3, r4 LD r6, 256(r5) Cache Miss ADD r6, r6, r0 Instruction Dispatch Data Cache Functional Unit Lebeck et al 02
Waiting Instruction Buffer r3, r4 Issue Queue Waiting Instruction Buffer ADD r5, r3, r4 LD r6, 256(r5) ADD r6, r6, r0 Cache Miss Instruction Dispatch Data Cache Functional Unit Lebeck et al 02
Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer LD r6, 256(r5) ADD r6, r6, r0 …. Miss Resolved …. …. Instruction Dispatch Data Cache Functional Unit Lebeck et al 02
Waiting Instruction Buffer Instructions reinserted Issue Queue Waiting Instruction Buffer LD r6, 256(r5) ADD r6, r6, r0 …. …. …. ADD r3, r1, r2 Instruction Dispatch Data Cache ADD r4, r1, r4 Functional Unit SLL r3, 0x4, r3 Lebeck et al 02
Waiting Instruction Buffer • No support for Back-to-back Execution with Parent Loads that Miss in the Cache • Power Consumption • Several Instructions Moves between the Issue Queue and the WIB • A Large WIB
Motivation behind Preschedulers • Compiler-heavy scheduling • “Dumber” scheduler • More conservative (on branches, load/store addresses, other run-time things) • Hardware-intensive scheduling • Takes advantage of knowledge at run-time • Much more complex
Motivation behind Preschedulers • Some dead instructions sit in scheduler slots • Reduce dead slots by only sending fireable instructions • Increases effective instruction window • Eliminates associative logic, decreasing: • Complexity • Delay (allowing for a possible clock speed increase) • Power consumption
Dataflow-based Prescheduler • Register Use Line Table (RULT), width W • Active line = ready instructions • line = max(a,b,c) + x • Max line of current line, lines of both operands • Circular setup • Each cycle, increment active line
Dataflow Prescheduler Performance 8-entry issue buffer, 12 lines, 8 FIFOs 16-entry issue buffer, 12 lines, 16 FIFOs • Avg. 54% performance increase for 8-entry buffer • Avg. 33% performance increase for 16-entry buffer Michaud et al. HPCA2001
Cyclone • Re-vamp the scheduler (take advantage of higher perf.) • Instrs from prescheduler go into countdown • When countdown reaches N/2 -> main queue • Main queue entries promote to the right • Column 0 is issued each cycle Ernst et al. ISCA2003
Cyclone (cont’d) • Replay mechanism • Register File Ready Bits for final operand check • Store set predictor • A conservative method avoiding load/store dependence messiness
Cyclone Performance • Decrease in latency • 8-decode, 8-issue Cyclone takes ~12% of area compared to 64-instruction 8-issue CAM Ernst et al. ISCA2003
Cyclone Analysis • Eliminates both wakeup and selection logic • Competition for issue ports • Congestion • Collisions during promotion (modifying promotion paths only shifts the pressure) • Replay-decode collisions
Wakeup-Free (WF) schemes:WF-Replay • Latency counters + selection logic • Uses entire scheduler • For 32 entry queue, issue width 4, 9% performance hit (vs. 25.5% of cyclone) • Issue width 6, performance hit of 0.2%, Issue width 8, performance hit of 0 Hu et al. HPCA2004
WF-Precheck • Do a precheck instead of replay • Check Reg Ready Bits before issuing • If not ready, recalculate timing • Increases complexity of selection logic Hu et al. HPCA2004
Segmented Issue Queue Hu et al. HPCA2004
Segmented Issue Queue Commentary • Rows represent different classes of latencies • Only select on lowest row (latency 0) • Sinking/Collapsing structure to prevent pileups
WF-Segment Performance • 5.8% perf. loss (3.5% vs. Precheck) Hu et al. HPCA2004
Conclusions • Low-power optimizations tend to target control logic • Don’t change underlying structure • Low-complexity optimizations • More creative designs • Low power • No appreciable performance loss (possibly speedup )