Soha Hassoun Tufts University Medford, MA Thanks to: Carl Ebeling University of Washington

Fine Grain Incremental Rescheduling Via Architectural Retiming Soha Hassoun Tufts University Medford, MA Thanks to: Carl Ebeling University of Washington Seattle, WA

Problem -- Clock period is too large Example Write Address RAM Read Address Offset

Pipelining Problems w/ consecutive dependent operations Write Address RAM Read Address Offset

Latency = n Performance Bottleneck • Latency constrained paths

Latency = n Performance Bottleneck • Latency constrained paths • Approach apply architectural retiming at the RT level

Architectural Retiming Problem:too much work, too little time yk

Architectural Retiming Problem:too much work, too little time D yk pipeline register

N Architectural Retiming Problem:too much work, too little time D C yk pipeline register negative register

N Architectural Retiming Problem:too much work, too little time D C yk pipeline register negative register precomputation prediction

Outline • Precomputation • incremental rescheduling without resource constraints • Prediction • incremental rescheduling with resource constraints • Results

x´ yk xi C D i g h f h N Precomputation Function D t= C t+1

x´ yk xi C D i g h f h N Precomputation Function • D t= C t+1 • = f ( ... , xi t+1 , ... )

x´ yk xi C D i g h f h N Precomputation Function • D t= C t+1 • = f ( ... , xi t+1 , ... ) xi t+1= x´i t =g( ... , ykt , ... )

x´ yk xi C D i g h f h N f´ Precomputation Function • D t= C t+1 • = f ( ... , xi t+1 , ... ) xi t+1= x´i t =g( ... , ykt , ... ) Dt= f ( ... , g( ... , ykt , ... ) , ...) = f´( ... , ykt , ... )

N Time n g Time n+1 f, h Incremental Rescheduling yk g h f h

N f´ Time n g Time n+1 f, h Time n f ’ Time n+1 h Incremental Rescheduling yk g h f h

PrecomputingWith Register Arrays Write Data Write Address Read Address Read Data Read Data

N F PrecomputingWith Register Arrays Write Data Write Address Read Address Out Read Data

Write Data Write Address Read Address Out N F Read Data PrecomputingWith Register Arrays • F t = Out t+1

Write Data Write Address Read Address Out N F Read Data PrecomputingWith Register Arrays • F t = Out t+1 • = Arrayt+1 [Read Addresst+1 ]

Synthesizing Bypass Paths Write Data Write Data Write Address Write Address Precomputed Read Address Read Address ? = Read Data Read Data

RAM N Precomputing RAM Output RAM

Z Prediction C D • What if ? • can’t precompute, • too many additional resources, or • performance is unsatisfactory gi f N

Z Prediction C D • What if ? • can’t precompute, • too many additional resources, or • performance is unsatisfactory • Predict C one cycle before its arrival gi f N

 Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 c2 h1 h2 H

 Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 c2 h1 h2 H Negative Register Verify

 Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 H Negative Register Verify

 h1 h2 Negative Register c2 Verify Schedule with Mispredictions R1 R2 C H t t+1 t-1 C c1 c2 H c2* c1* c1*=? c1 c2*=? c2

Synthesis Issues in Prediction • Negative register as predicting FSM • use signal transition probabilities • incorporate don’t care conditions • Nullifying mispredictions • Two correction strategies • As-Soon-As-Possible restoration • As-Late-As-Possible correction • Add handshaking signals to coordinate with interface

Related Work • Precomputation • Bypass Synthesis • lookahead [Kogge ‘81, …..] • Prediction / Speculative Execution • Most likely path, arbitrarily deep [Holtmann & Ernst ‘93,’95] • Pre-execution [Radivojevic & Brewer ‘94] • Possible multiple paths & arbitrarily deep [Lakshminarayana et al. ‘98] • Percolation scheduling [Potasman et al. ‘90]

Results

Architectural Retiming • Improves throughput while preserving functionality and sometimes latency • Bridge gap between HLS and logic optimizations • Unifies several sequential optimizations • bypass synthesis • lookahead transformation • branch prediction • fine-grain cross register optimizations

Ph.D. Forum at DAC ‘99 • Goal • increase interaction between academia and industry • Format • students present work at poster session at DAC • researchers give feedback • Who’s eligible? • Students within 1 or 2 years of finishing Ph.D. thesis www.cs.washington.edu/homes/soha/forum

The End

Precomputing in Single-Register Cycles A B Original Circuit

N Precomputing in Single-Register Cycles A B Original Circuit

A B A' B' Precomputing in Single-Register Cycles A B N Lookahead -- A(n) is a function of B(n-2) [Kogge, ‘81], [Parhi & Messerschmidtt, ‘89]

Precomputing RAM Output RAM RAM

Speculative Execution Scope and Depth c1 c3 c2 c4 c6 c5

Speculative Execution Scope and Depth

Soha Hassoun Tufts University Medford, MA Thanks to: Carl Ebeling University of Washington