680 likes | 906 Views
Spatial Computation. Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003. CPU Problems. Design Complexity Power Global Signals Limited issue window ) limited ILP. Communication vs. Computation. wire. gate. 5ps. 20ps. Power consumption on wires is also dominant. Global Communication.
E N D
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003
CPU Problems • Design Complexity • Power • Global Signals • Limited issue window ) limited ILP
Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant
Global Communication Instruction unit Reg Network
1) Unroll Pipeline Instruction unit Reg Network Reg Network Reg Network originalprocessor
Resource Binding Time 1. 1. Programs 2. 2. Programs CPU ASH
2) Specialize Pipeline Fixed program Instruction unit Reg Network Reg Network Reg Network
2) Specialize Pipeline:Functional Units Fixed program Instruction unit Reg Network Reg Network Reg Network
2) Specialize Pipeline:Interconnection Network Fixed program Instruction unit Reg Reg Reg
2) Specialize Pipeline:Register Files Fixed program Instruction unit 0 1
2) Specialize Pipeline: Shrink Wires Fixed program Instruction unit 0 1
2) Specialize Pipeline:No Instruction Fetch, Decode, Issue 0 1
Loops 0 1
Memory Spatial Computation LSQ To memory 0 1
Outline • Introduction • CASH: Compiling for ASH • ASH vs CPU • Analyzing the Results • Conclusions
Application-Specific Hardware C program Dataflow IR Compiler dataflow machine Reconfigurable/custom hw
Asynchronous Computation + latch data ack data valid
FSM Distributed Control Logic ack rdy + - more info
Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y Conditionals ) Speculation
p ! Split (branch) Control Flow ) Data Flow data Merge data data predicate Gateway
0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;
Outline • Introduction • Compiling for ASH • ASH vs CPU • Analyzing the Results • Conclusions
ASH vs: • 4- & 8-wide VLIWs • Superscalar, media kernels • Superscalar, SpecInt95
OpenDIVX IDCT,Sustained IPC includes speculative ops no data
wrong! This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good)
Outline • Introduction: spatial computation • CASH: Compiling for ASH • ASH vs CPU • Dissection • Conclusions
The (Loop) Body for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized
definition Dynamic Critical Path sizeof(X[j]) load predicate loop predicate for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;
MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence
If Branch Prediction Correct LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained
Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;
Prediction + Load Speculation ack edge ~4 cycles! Load not pipelined (self-anti-dependence) for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;
register renaming OOO Pipe Snapshot LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: IF DA EX WB CT L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 L3 L5 L1 L2
Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; if (X[j+1].r == i) break; } Y[i] = X[j].q; } when 1 iteration
ASH Problems • Both branch and join not free • Static dataflow (no re-issue of same instr) • Memory is “far” • Fully static • No branch prediction • No dynamic unrolling • No register renaming • Calls/returns not lenient • No virtualization • No dynamic optimization
Outline • Introduction: spatial computation • CASH: Compiling for ASH • ASH vs CPU • Result Analysis • Conclusions
Conclusions • ASH promising for media processing; to evaluate • power • performance • cost • Prediction does much more than avoid issue stalls • von Neumann model of computation very powerful • hardware resources are not everything
Backup Slides • Evaluation model • Control logic • Pipeline balancing • Lenient execution • Dynamic Critical Path
How Performance Is Evaluated C Mem L2 1/4M L1 8K LSQ 2 limited BW (2 words/c) Unlimited ILP 8 72
Simulation Parameters • Compared to 4-wide OOO SimpleScalar • Same operation latencies • Same cache hierarchy • No measurements in library functions/OS • 3-cycle multiply, 20 cycle divide back
Control Logic rdyin C C ackin D ackout rdyout D datain dataout Reg back back to talk
Outline • Introduction • Compiling for ASH • ASH at run-time • ASH vs CPU • Conclusions
Critical Paths b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y