580 likes | 700 Views
Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment. ISPD 2005 San Francisco, CA May 5th, 2005 Mario R. Casu - Politecnico di Torino and Luca Macchiarulo - University of Hawaii at Manoa. Outline. Communication concerns at the physical layer
E N D
Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment ISPD 2005 San Francisco, CA May 5th, 2005 Mario R. Casu - Politecnico di Torino and Luca Macchiarulo - University of Hawaii at Manoa
Outline • Communication concerns at the physical layer • Great Expectations of “Wire Pipelining” • No block Delay • Block delay limitation • Computation locality • Adaptive Communications • Floorplanning strategy for adaptive systems • Experimental results
Wire pipelining - concept • Wire delay: substantial share of overall delay • Global wires difficult to deal with • Global wires scaling does not follow • Transistors • Local wiring Del
Wire pipelining - concept • Introducing a latch/FF reduces the timing constraints • Similar to classical pipelining Del’ Del’’
Critical Length • Maximal length for which the wire can be driven at a given frequency • Optimum number of buffers • Optimum buffer dimensions • Optimum wire sizing Del=1/f
Wire Pipelining • Above Critical length clocked elements are needed (pipeline stages) Del>1/f
“Wire Pipelining” techniques • Problem: maintaining functionality with a minimum loss in performance. • Solutions: • Globally Asynchronous Locally Synchronous – GALS • Retiming • Regular Distributed Register (J. Cong) • c-slowing (S. Sapatnekar) • Latency Insensitive Protocols (L. Carloni)
Pearl Shell Relay Station LIPs: Concept
Shell – Relay Station Interaction valid stop
Feedback Topology τ 0 τ τ τ 0 0
Feedback Topology τ τ 0 0 0 0τ τ
Feedback Topology 0 τ 0 τ τ 0τ1 1
Feedback Topology τ 1 1 τ 1 0τ1τ τ
Feedback Topology 1 τ 1 1 τ 0τ1ττ τ
Feedback Topology τ 2 τ τ τ 0τ1ττ2 2
Feedback Topology: Performance • Void data circulate in the loops: initially as many as relay stations (s) • “Period” of void-stop equal to the number of shells (s) and relay station (r) in the loop • Worst loop fixes thr. • T=s/(s+r) • Ta=2/4, Tb=2/5 T=2/5 τ 2 τ a b τ τ 0τ1ττ2 2
Classical Floorplanning • Problem: find a placement of (soft or hard) blocks that optimally fits a floorplan • Optimality is Whitespace, overall Wirelength, critical path, or a combination
Floorplanning for Throughput [ISPD2004] • The optimal floorplan in our case is that which guarantees the maximum throughput compatible with given blocks’ dimensions • Maximum throughput is equivalent to the worst cost-to-time ratio loop
New Heuristic Throughput Computation • Heuristic: • Statically compute the shortest loop l(e) in which every edge appears • For every optimization iteration: • Cost(e)=1/l(e)*floor(length/Clength) • TotCost=Scost(e)
DR0=1.1/L=1/L Throughput-frequency trade-off f=1/L T=1
DR=1/2.2/L=1/L Throughput-frequency trade-off f=2/L T=2/(2+2)=1/2 No advantage!
DR0=1/L.1=1/L Throughput-frequency trade-off L/2 L L f=1/L T=1
DR=2/L.3/5=6/5L Throughput-frequency trade-off L/2 L/2 L/2 f=2/L T=3/(3+2) L/2 L/2
Data Rate as the basic performance metric – Speed-up • Wire pipelining allows increased frequency • But it decreases the throughput according to the previous considerations • Real performance is given by DATA RATE=Thr*f • Advantage w.r.t. non-pipelined systems to be assessed through DR measures • Speed-Up SU=DR/DR0 • L/(lm+lmax)<SU<L/lm • Floorplanning can be extremely beneficial if it can reduce the average branch length lm
Block delay effect • Blocks put a cap to the max frequency • fmax<1/max(di) i • We can measure delay in “length”, by using a proportionality factor • Block delay can enter in the picture if signals are latched at the input or output side only L ld
Block delay models • We used two different models • Delay proportional to block edge • Rationale: complexity of logic is related to block size • Minimum constant of proportionality=1: delay is the same needed for the fastest signal to traverse the entire block • Optimistic assumption • Delay constant, related to technology and equal to 13FO4 • Derived for assumption in the roadmap • More realistic for high performance design • More pessimistic (see below) • Probably the reality is somehow between the two cases
Speed-up with block delay • Taking the block delay into account modifies the previous considerations • max(Li+di)/(lm+dm+dmax)<SU<max(Li+di)/(lm+dm) • In general, much worse than previous case
Throughput driven floorplan experiments • We used the floorplanner described in ISPD’04 to evaluate the optimal frequency (maximum DR) • On GSRC and MCNC benchmarks with input-output information • No block delay: • SU varies between 0.8 to 36% • Better on benchmarks with greater complexity • Block delay • Proportional to blocks’ edges: -7% to 44% • Equal to 13FO4: -11% to 12% • MCNC suite shows the worse behavior • High speed systems with highly optimized blocks lead to negligible or irrelevant SU, for an high increase of clock frequency.
Space for better performance? • Not all point to point connections are actually used at every clock cycle. • Ex. CPU to Cache communication. Read cycle Addr Data-out Data-in
Space for better performance? • Not all point to point connections are actually used at every clock cycle. • Ex. CPU to Cache communication. Write cycle Addr Data-out Data-in
Space for better performance? • Unused communication channel effectively break throughput-limiting loops • Pipelining without limitation can become possible Stream Write cycle Addr 1 τ Data-out 1
Space for better performance? • Unused communication channel effectively break throughput-limiting loops • Pipelining without limitation can become possible Stream Write cycle Addr 2 Addr 1 Data-out 2 Data-out 1
Space for better performance? • Unused communication channel effectively break throughput-limiting loops • Pipelining without limitation can become possible Stream Write cycle Addr 3 Addr 2 Data-out 3 Data-out 2
Adaptive Latency Insensitive Protocol • Need a mechanism to allow discarding useless “packets” by blocks: Adaptive communication • Details out of the scope of the paper but • It is possible thorugh a simple modification of the original protocol • Requires the introduction of “oracles” predicting unused inputs for each block • We designed a functional implementation in synthesizable VHDL • We proved the correctness of the implementation (absence of deadlocks and correct signal sequencing)
ALIP performance evaluation • The adaptiveness of the approach prevents a static prediction of performance • However, a few conclusion can be reached: • The performance is bounded above by static LIP • Performance in long sequences of input independence is equivalent to the simplified network with the channel removed • If the system experiences unfrequent “context switching” on its channels, such that at any given time the performance is static Thi, the average performance can be approximated as: • Th=Sai.Thi • ai: fraction of time with performance Thi
ALIP performance evaluation - Example Ck=1 Valid Data=1 Stream Write cycle Addr 1 τ Data-out 1
ALIP performance evaluation - Example Ck=2 Valid Data=2 Stream Write cycle Addr 2 Addr 1 Data-out 2 Data-out 1
ALIP performance evaluation - Example Ck=3 Valid Data=3 Stream Write cycle Addr 3 Addr 2 Data-out 3 Data-out 2
ALIP performance evaluation - Example Ck=4 Valid Data=4 Read cycle Addr 4 Addr 3 Data-out 3
ALIP performance evaluation - Example Ck=5 Valid Data=5 Read cycle ----- Addr 4 τ τ
ALIP performance evaluation - Example Ck=6 Valid Data=5 Read cycle τ ----- τ Data-in4
ALIP performance evaluation - Example Ck=7 Valid Data=5 Read cycle τ τ ----- Data-in4
ALIP performance evaluation - Example Ck=8 Valid Data=6 Read cycle τ Addr 5 τ -----
ALIP performance evaluation - Example Ck=8 Valid Data=6 Throughput=3/4 Th1=1 Th2=1/2 a1=1/2 a2=1/2 Read cycle τ Addr 5 τ -----
Adaptive communication performance evaluation - assumptions • Assumption 1: No time lost in “context switching” • Unrealistic, but acceptable for burst communication, and consistent with experiments • Assumption 2: Channels behave in a statistically independent fashion • Only single clock cycle independence is important for our purposes • Under 1 and 2, we can compute channel activities and use them to weight the connections
Floorplanning for Throughput – adaptive case • The optimal floorplan in our case is that which guarantees the maximum throughput compatible with given blocks’ dimensions • Maximum throughput is equivalent to the worst cost-to-time ratio loop, weighted by the loop activation ratio • It can be approximated by taking into account the channel activation ratio
New Heuristic Throughput Computation • Heuristic: • Statically compute the shortest loop l(e) in which every edge appears • For every optimization iteration: • Cost(e)=1/l(e)*floor(length/Clength)*a(e) • TotCost=Scost(e) • The only change consists in the inclusion of the term a(e)
Experiments • GSRC/MCNC benchmarks • Burst mode • Uniformly distributed phases and activation times • Comparison between non-pipelined solution and adaptively pipelined (13FO4 case) • After optimization, a VHDL netlist is automatically generated and simulated to measure the real performance of the system (as opposed to the approximation from the floorplanner) • Results: • SU between 16 and 44% • Monotonous behavior in the legal interval • Limitations due mainly to FO4 delays
Experiments • MPEG decoder • Strict data dependency • Optimization as in other cases • Simulation as before and with real channel utilization profiles • Results: • SU of 42% with block delay, 76% without • Real SU of 31% (effect of non-random correlation)
Conclusions and future work • Pure “blind” pipelining fails to achive available optimization, due to neglect of common information • Adaptive protocols can take advantage of the information available to the blocks • We will concentrate on • Automated extraction of information from the blocks • Power optimization (power/timing trade-offs) • Routing constraints effects