410 likes | 542 Views
Fine Grain Incremental Rescheduling Via Architectural Retiming. Soha Hassoun Tufts University Medford, MA Thanks to: Carl Ebeling University of Washington Seattle, WA. Problem -- Clock period is too large. Example. Write Address. RAM. Read Address. Offset. Pipelining.
E N D
Fine Grain Incremental Rescheduling Via Architectural Retiming Soha Hassoun Tufts University Medford, MA Thanks to: Carl Ebeling University of Washington Seattle, WA
Problem -- Clock period is too large Example Write Address RAM Read Address Offset
Pipelining Problems w/ consecutive dependent operations Write Address RAM Read Address Offset
Latency = n Performance Bottleneck • Latency constrained paths
Latency = n Performance Bottleneck • Latency constrained paths • Approach apply architectural retiming at the RT level
Architectural Retiming Problem:too much work, too little time yk
Architectural Retiming Problem:too much work, too little time D yk pipeline register
N Architectural Retiming Problem:too much work, too little time D C yk pipeline register negative register
N Architectural Retiming Problem:too much work, too little time D C yk pipeline register negative register precomputation prediction
Outline • Precomputation • incremental rescheduling without resource constraints • Prediction • incremental rescheduling with resource constraints • Results
x´ yk xi C D i g h f h N Precomputation Function D t= C t+1
x´ yk xi C D i g h f h N Precomputation Function • D t= C t+1 • = f ( ... , xi t+1 , ... )
x´ yk xi C D i g h f h N Precomputation Function • D t= C t+1 • = f ( ... , xi t+1 , ... ) xi t+1= x´i t =g( ... , ykt , ... )
x´ yk xi C D i g h f h N f´ Precomputation Function • D t= C t+1 • = f ( ... , xi t+1 , ... ) xi t+1= x´i t =g( ... , ykt , ... ) Dt= f ( ... , g( ... , ykt , ... ) , ...) = f´( ... , ykt , ... )
N Time n g Time n+1 f, h Incremental Rescheduling yk g h f h
N f´ Time n g Time n+1 f, h Time n f ’ Time n+1 h Incremental Rescheduling yk g h f h
PrecomputingWith Register Arrays Write Data Write Address Read Address Read Data Read Data
N F PrecomputingWith Register Arrays Write Data Write Address Read Address Out Read Data
Write Data Write Address Read Address Out N F Read Data PrecomputingWith Register Arrays • F t = Out t+1
Write Data Write Address Read Address Out N F Read Data PrecomputingWith Register Arrays • F t = Out t+1 • = Arrayt+1 [Read Addresst+1 ]
Synthesizing Bypass Paths Write Data Write Data Write Address Write Address Precomputed Read Address Read Address ? = Read Data Read Data
RAM N Precomputing RAM Output RAM
Z Prediction C D • What if ? • can’t precompute, • too many additional resources, or • performance is unsatisfactory gi f N
Z Prediction C D • What if ? • can’t precompute, • too many additional resources, or • performance is unsatisfactory • Predict C one cycle before its arrival gi f N
Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 c2 h1 h2 H
Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 c2 h1 h2 H Negative Register Verify
Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 H Negative Register Verify
h1 h2 Negative Register c2 Verify Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 c2 H c2* c1* c1*=? c1 c2*=? c2
Synthesis Issues in Prediction • Negative register as predicting FSM • use signal transition probabilities • incorporate don’t care conditions • Nullifying mispredictions • Two correction strategies • As-Soon-As-Possible restoration • As-Late-As-Possible correction • Add handshaking signals to coordinate with interface
Related Work • Precomputation • Bypass Synthesis • lookahead [Kogge ‘81, …..] • Prediction / Speculative Execution • Most likely path, arbitrarily deep [Holtmann & Ernst ‘93,’95] • Pre-execution [Radivojevic & Brewer ‘94] • Possible multiple paths & arbitrarily deep [Lakshminarayana et al. ‘98] • Percolation scheduling [Potasman et al. ‘90]
Architectural Retiming • Improves throughput while preserving functionality and sometimes latency • Bridge gap between HLS and logic optimizations • Unifies several sequential optimizations • bypass synthesis • lookahead transformation • branch prediction • fine-grain cross register optimizations
Ph.D. Forum at DAC ‘99 • Goal • increase interaction between academia and industry • Format • students present work at poster session at DAC • researchers give feedback • Who’s eligible? • Students within 1 or 2 years of finishing Ph.D. thesis www.cs.washington.edu/homes/soha/forum
Precomputing in Single-Register Cycles A B Original Circuit
N Precomputing in Single-Register Cycles A B Original Circuit
A B A' B' Precomputing in Single-Register Cycles A B N Lookahead -- A(n) is a function of B(n-2) [Kogge, ‘81], [Parhi & Messerschmidtt, ‘89]
Precomputing RAM Output RAM RAM
Precomputing RAM Output RAM RAM
Speculative Execution Scope and Depth c1 c3 c2 c4 c6 c5
Speculative Execution Scope and Depth