Survey of Low-Complexity, Low Power Instruction Scheduling

Survey of Low-Complexity, Low Power Instruction Scheduling Alex Li Lewen Lo Sara Sadeghi Baghsorkhi

Motivation • Scalability of instruction window size • Extract greater ILP • Power consumption • CAM logic is power hungry • Complexity • Wire delay of associative logic dominates gate delay in scheduler

Outline • Wakeup logic optimizations • Distributed instruction queues • Waiting Instruction Buffer • Preschedulers • Cyclone • Wakeup-Free

Wakeup Logic Result Reg Tag Wakeup Logic = = Opcode FU Type Dest Reg V1 Src Reg 1 V2 Src Reg 2 R ••• To Select Logic

Gated Tag Matching tail head • Rationale • Parts of IQ wasting energy • Energy-wasting sources • Empty area • Ready operands • Issued instructions • Solution • Gate the comparators! Folegnani et al. ISCA2001

Gated Tag Matching • Furthermore… • young instr. contributes little to performance • Solution: Dynamic resizing • Use limit pointer & performance counters • Reduce size as long as < IPC threshold • Increase size if > IPC threshold for a set period • Cost: Additional logic for • Gated comparators • Performance counter • Claims • 128-entry queue, effective size ~ 43 • 4% performance loss • 90.7% wakeup logic energy savings • 14.9% chip energy savings • Significant energy savings • Based on conventional design, no performance benefit limit tail head Folegnani et al. ISCA2001

Tag Elimination • Rationale • 1 ready operand in most instr. (80-96%) • Last arriving operand wakes up instr. • Base Approach • Issue window w/ 2-, 1-, and 0-comparator entries • Insert instr. based on operand readiness • Advanced Approach • Eliminate 2-comparator entries • Predict last-arriving operand • Re-issue on misprediction • Results on (32 1-comp/32 0-comp) • Slight IPC loss (1-3%), • Account for reduced delay, good speedup (25-45%) • 65-75% lower energy-delay product • Drastically reduce associative logic (1/4) • reduce energy • no performance impact (even speedup) = = 2-comp entries = = = = 1-comp entries = = 0-comp entries Ernst et al. ISCA2002

N-use Issue Logic • Rationale • 1 (or few) dep. instr. for most instr (75-78%) • Approach • More SRAM (N-use table) • Less CAM (I-buffer) • Wakeup dependents only • Claims • 2-use table, 2-entry I-buffer comparable to 64-entry CAM (~4% slowdown) • 96 regs  192 entries in 2-use table! • Justifications • DOES reduce CAM (64 to 2 cells) • Energy to support 2-use table  gated entries • Less complex, but maybe more area • Cycle time may be reduced • Drastically different design Canal et al. ICS2001

Distributed Instruction Queue(FIFO) • Instructions in a queue are a dependence chain. • Only instructions at the head of the queues can be ready. • Works well for INT codes, but poor for FP codes. • Large number of FIFOs increases its complexity Palacharla et al 97

Distributed Instruction Queue(Buffer) • Multiple dependence chains share a queue. • Queues are not FIFOs but they do not require wake-up. • Different Dispatch Order and Issue Order • Use latencies at issue time to decide which will be the next selected instruction. • Still a Simple Selection Logic • Same Performance and Less Power Consumption Abella and Gonzalez 04

Selection Logic Abella and Gonzalez 04

Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer LD r1, 1024(r0) ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 ADD r5, r3, r4 Instruction Dispatch Issued Data Cache LD r6, 256(r5) Functional Unit ADD r6, r6, r0 Lebeck et al 02

Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 Cache Miss ADD r5, r3, r4 LD r6, 256(r5) Instruction Dispatch Data Cache ADD r6, r6, r0 Functional Unit Load miss on r1 Lebeck et al 02

Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 Cache Miss ADD r5, r3, r4 LD r6, 256(r5) Instruction Dispatch Data Cache ADD r6, r6, r0 Functional Unit Lebeck et al 02

Waiting Instruction Buffer r3, r4 Issue Queue Waiting Instruction Buffer SLL r3, 0x4, r3 SUB r4, r4, r2 ADD r5, r3, r4 LD r6, 256(r5) Cache Miss ADD r6, r6, r0 Instruction Dispatch Data Cache Functional Unit Lebeck et al 02

Waiting Instruction Buffer r3, r4 Issue Queue Waiting Instruction Buffer ADD r5, r3, r4 LD r6, 256(r5) ADD r6, r6, r0 Cache Miss Instruction Dispatch Data Cache Functional Unit Lebeck et al 02

Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer LD r6, 256(r5) ADD r6, r6, r0 …. Miss Resolved …. …. Instruction Dispatch Data Cache Functional Unit Lebeck et al 02

Waiting Instruction Buffer Instructions reinserted Issue Queue Waiting Instruction Buffer LD r6, 256(r5) ADD r6, r6, r0 …. …. …. ADD r3, r1, r2 Instruction Dispatch Data Cache ADD r4, r1, r4 Functional Unit SLL r3, 0x4, r3 Lebeck et al 02

Waiting Instruction Buffer • No support for Back-to-back Execution with Parent Loads that Miss in the Cache • Power Consumption • Several Instructions Moves between the Issue Queue and the WIB • A Large WIB

Motivation behind Preschedulers • Compiler-heavy scheduling • “Dumber” scheduler • More conservative (on branches, load/store addresses, other run-time things) • Hardware-intensive scheduling • Takes advantage of knowledge at run-time • Much more complex

Motivation behind Preschedulers • Some dead instructions sit in scheduler slots • Reduce dead slots by only sending fireable instructions • Increases effective instruction window • Eliminates associative logic, decreasing: • Complexity • Delay (allowing for a possible clock speed increase) • Power consumption

Dataflow-based Prescheduler • Register Use Line Table (RULT), width W • Active line = ready instructions • line = max(a,b,c) + x • Max line of current line, lines of both operands • Circular setup • Each cycle, increment active line

Dataflow Prescheduler Performance 8-entry issue buffer, 12 lines, 8 FIFOs 16-entry issue buffer, 12 lines, 16 FIFOs • Avg. 54% performance increase for 8-entry buffer • Avg. 33% performance increase for 16-entry buffer Michaud et al. HPCA2001

Cyclone • Re-vamp the scheduler (take advantage of higher perf.) • Instrs from prescheduler go into countdown • When countdown reaches N/2 -> main queue • Main queue entries promote to the right • Column 0 is issued each cycle Ernst et al. ISCA2003

Cyclone (cont’d) • Replay mechanism • Register File Ready Bits for final operand check • Store set predictor • A conservative method avoiding load/store dependence messiness

Cyclone Performance • Decrease in latency • 8-decode, 8-issue Cyclone takes ~12% of area compared to 64-instruction 8-issue CAM Ernst et al. ISCA2003

Cyclone Analysis • Eliminates both wakeup and selection logic • Competition for issue ports • Congestion • Collisions during promotion (modifying promotion paths only shifts the pressure) • Replay-decode collisions

Wakeup-Free (WF) schemes:WF-Replay • Latency counters + selection logic • Uses entire scheduler • For 32 entry queue, issue width 4, 9% performance hit (vs. 25.5% of cyclone) • Issue width 6, performance hit of 0.2%, Issue width 8, performance hit of 0 Hu et al. HPCA2004

WF-Precheck • Do a precheck instead of replay • Check Reg Ready Bits before issuing • If not ready, recalculate timing • Increases complexity of selection logic Hu et al. HPCA2004

Segmented Issue Queue Hu et al. HPCA2004

Segmented Issue Queue Commentary • Rows represent different classes of latencies • Only select on lowest row (latency 0) • Sinking/Collapsing structure to prevent pileups

WF-Segment Performance • 5.8% perf. loss (3.5% vs. Precheck) Hu et al. HPCA2004

Conclusions • Low-power optimizations tend to target control logic • Don’t change underlying structure • Low-complexity optimizations • More creative designs • Low power • No appreciable performance loss (possibly speedup )

Backup Slides

Survey of Low-Complexity, Low Power Instruction Scheduling

Survey of Low-Complexity, Low Power Instruction Scheduling

Presentation Transcript

Low – power testing

Instruction Set Architecture (ISA) for Low Power

Low-Power Lasers

Low Complexity MAC Scheduling Algorithms with Performance Guarantee

Low Power Processors

Low Power WiFi

Low Power RF

Low Complexity MAC Scheduling Algorithms with Performance Guarantee

“Low Power Operation”

Low-power Task Scheduling for GPU Energy Reduction

Low Power Clocking

Low Power Memory

Low Voltage Low Power Dram

Low Power RF Low Noise Amplifier

Why Low Power ?

Low Power Task Scheduling for Multiple Devices

Low-complexity Scheduling for Wireless Networks

Low power CDN

1 = High negotiation power, low complexity (for Orkla)

Low Power Clocking

Instruction Set Architecture (ISA) for Low Power

Thermal-Scheduling For Ultra Low Power Mobile Microprocessor