Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment

Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment ISPD 2005 San Francisco, CA May 5th, 2005 Mario R. Casu - Politecnico di Torino and Luca Macchiarulo - University of Hawaii at Manoa

Outline • Communication concerns at the physical layer • Great Expectations of “Wire Pipelining” • No block Delay • Block delay limitation • Computation locality • Adaptive Communications • Floorplanning strategy for adaptive systems • Experimental results

Wire pipelining - concept • Wire delay: substantial share of overall delay • Global wires difficult to deal with • Global wires scaling does not follow • Transistors • Local wiring Del

Wire pipelining - concept • Introducing a latch/FF reduces the timing constraints • Similar to classical pipelining Del’ Del’’

Critical Length • Maximal length for which the wire can be driven at a given frequency • Optimum number of buffers • Optimum buffer dimensions • Optimum wire sizing Del=1/f

Wire Pipelining • Above Critical length clocked elements are needed (pipeline stages) Del>1/f

“Wire Pipelining” techniques • Problem: maintaining functionality with a minimum loss in performance. • Solutions: • Globally Asynchronous Locally Synchronous – GALS • Retiming • Regular Distributed Register (J. Cong) • c-slowing (S. Sapatnekar) • Latency Insensitive Protocols (L. Carloni)

Pearl Shell Relay Station LIPs: Concept

Shell – Relay Station Interaction valid stop

Feedback Topology τ 0 τ τ τ 0 0

Feedback Topology τ τ 0 0 0 0τ τ

Feedback Topology 0 τ 0 τ τ 0τ1 1

Feedback Topology τ 1 1 τ 1 0τ1τ τ

Feedback Topology 1 τ 1 1 τ 0τ1ττ τ

Feedback Topology τ 2 τ τ τ 0τ1ττ2 2

Feedback Topology: Performance • Void data circulate in the loops: initially as many as relay stations (s) • “Period” of void-stop equal to the number of shells (s) and relay station (r) in the loop • Worst loop fixes thr. • T=s/(s+r) • Ta=2/4, Tb=2/5 T=2/5 τ 2 τ a b τ τ 0τ1ττ2 2

Classical Floorplanning • Problem: find a placement of (soft or hard) blocks that optimally fits a floorplan • Optimality is Whitespace, overall Wirelength, critical path, or a combination

Floorplanning for Throughput [ISPD2004] • The optimal floorplan in our case is that which guarantees the maximum throughput compatible with given blocks’ dimensions • Maximum throughput is equivalent to the worst cost-to-time ratio loop

New Heuristic Throughput Computation • Heuristic: • Statically compute the shortest loop l(e) in which every edge appears • For every optimization iteration: • Cost(e)=1/l(e)*floor(length/Clength) • TotCost=Scost(e)

DR0=1.1/L=1/L Throughput-frequency trade-off f=1/L T=1

DR=1/2.2/L=1/L Throughput-frequency trade-off f=2/L T=2/(2+2)=1/2 No advantage!

DR0=1/L.1=1/L Throughput-frequency trade-off L/2 L L f=1/L T=1

DR=2/L.3/5=6/5L Throughput-frequency trade-off L/2 L/2 L/2 f=2/L T=3/(3+2) L/2 L/2

Data Rate as the basic performance metric – Speed-up • Wire pipelining allows increased frequency • But it decreases the throughput according to the previous considerations • Real performance is given by DATA RATE=Thr*f • Advantage w.r.t. non-pipelined systems to be assessed through DR measures • Speed-Up SU=DR/DR0 • L/(lm+lmax)<SU<L/lm • Floorplanning can be extremely beneficial if it can reduce the average branch length lm

Block delay effect • Blocks put a cap to the max frequency • fmax<1/max(di) i • We can measure delay in “length”, by using a proportionality factor • Block delay can enter in the picture if signals are latched at the input or output side only L ld

Block delay models • We used two different models • Delay proportional to block edge • Rationale: complexity of logic is related to block size • Minimum constant of proportionality=1: delay is the same needed for the fastest signal to traverse the entire block • Optimistic assumption • Delay constant, related to technology and equal to 13FO4 • Derived for assumption in the roadmap • More realistic for high performance design • More pessimistic (see below) • Probably the reality is somehow between the two cases

Speed-up with block delay • Taking the block delay into account modifies the previous considerations • max(Li+di)/(lm+dm+dmax)<SU<max(Li+di)/(lm+dm) • In general, much worse than previous case

Throughput driven floorplan experiments • We used the floorplanner described in ISPD’04 to evaluate the optimal frequency (maximum DR) • On GSRC and MCNC benchmarks with input-output information • No block delay: • SU varies between 0.8 to 36% • Better on benchmarks with greater complexity • Block delay • Proportional to blocks’ edges: -7% to 44% • Equal to 13FO4: -11% to 12% • MCNC suite shows the worse behavior • High speed systems with highly optimized blocks lead to negligible or irrelevant SU, for an high increase of clock frequency.

Space for better performance? • Not all point to point connections are actually used at every clock cycle. • Ex. CPU to Cache communication. Read cycle Addr Data-out Data-in

Space for better performance? • Not all point to point connections are actually used at every clock cycle. • Ex. CPU to Cache communication. Write cycle Addr Data-out Data-in

Space for better performance? • Unused communication channel effectively break throughput-limiting loops • Pipelining without limitation can become possible Stream Write cycle Addr 1 τ Data-out 1

Space for better performance? • Unused communication channel effectively break throughput-limiting loops • Pipelining without limitation can become possible Stream Write cycle Addr 2 Addr 1 Data-out 2 Data-out 1

Space for better performance? • Unused communication channel effectively break throughput-limiting loops • Pipelining without limitation can become possible Stream Write cycle Addr 3 Addr 2 Data-out 3 Data-out 2

Adaptive Latency Insensitive Protocol • Need a mechanism to allow discarding useless “packets” by blocks: Adaptive communication • Details out of the scope of the paper but • It is possible thorugh a simple modification of the original protocol • Requires the introduction of “oracles” predicting unused inputs for each block • We designed a functional implementation in synthesizable VHDL • We proved the correctness of the implementation (absence of deadlocks and correct signal sequencing)

ALIP performance evaluation • The adaptiveness of the approach prevents a static prediction of performance • However, a few conclusion can be reached: • The performance is bounded above by static LIP • Performance in long sequences of input independence is equivalent to the simplified network with the channel removed • If the system experiences unfrequent “context switching” on its channels, such that at any given time the performance is static Thi, the average performance can be approximated as: • Th=Sai.Thi • ai: fraction of time with performance Thi

ALIP performance evaluation - Example Ck=1 Valid Data=1 Stream Write cycle Addr 1 τ Data-out 1

ALIP performance evaluation - Example Ck=2 Valid Data=2 Stream Write cycle Addr 2 Addr 1 Data-out 2 Data-out 1

ALIP performance evaluation - Example Ck=3 Valid Data=3 Stream Write cycle Addr 3 Addr 2 Data-out 3 Data-out 2

ALIP performance evaluation - Example Ck=4 Valid Data=4 Read cycle Addr 4 Addr 3 Data-out 3

ALIP performance evaluation - Example Ck=5 Valid Data=5 Read cycle ----- Addr 4 τ τ

ALIP performance evaluation - Example Ck=6 Valid Data=5 Read cycle τ ----- τ Data-in4

ALIP performance evaluation - Example Ck=7 Valid Data=5 Read cycle τ τ ----- Data-in4

ALIP performance evaluation - Example Ck=8 Valid Data=6 Read cycle τ Addr 5 τ -----

ALIP performance evaluation - Example Ck=8 Valid Data=6 Throughput=3/4 Th1=1 Th2=1/2 a1=1/2 a2=1/2 Read cycle τ Addr 5 τ -----

Adaptive communication performance evaluation - assumptions • Assumption 1: No time lost in “context switching” • Unrealistic, but acceptable for burst communication, and consistent with experiments • Assumption 2: Channels behave in a statistically independent fashion • Only single clock cycle independence is important for our purposes • Under 1 and 2, we can compute channel activities and use them to weight the connections

Floorplanning for Throughput – adaptive case • The optimal floorplan in our case is that which guarantees the maximum throughput compatible with given blocks’ dimensions • Maximum throughput is equivalent to the worst cost-to-time ratio loop, weighted by the loop activation ratio • It can be approximated by taking into account the channel activation ratio

New Heuristic Throughput Computation • Heuristic: • Statically compute the shortest loop l(e) in which every edge appears • For every optimization iteration: • Cost(e)=1/l(e)*floor(length/Clength)*a(e) • TotCost=Scost(e) • The only change consists in the inclusion of the term a(e)

Experiments • GSRC/MCNC benchmarks • Burst mode • Uniformly distributed phases and activation times • Comparison between non-pipelined solution and adaptively pipelined (13FO4 case) • After optimization, a VHDL netlist is automatically generated and simulated to measure the real performance of the system (as opposed to the approximation from the floorplanner) • Results: • SU between 16 and 44% • Monotonous behavior in the legal interval • Limitations due mainly to FO4 delays

Experiments • MPEG decoder • Strict data dependency • Optimization as in other cases • Simulation as before and with real channel utilization profiles • Results: • SU of 42% with block delay, 76% without • Real SU of 31% (effect of non-random correlation)

Conclusions and future work • Pure “blind” pipelining fails to achive available optimization, due to neglect of common information • Adaptive protocols can take advantage of the information available to the blocks • We will concentrate on • Automated extraction of information from the blocks • Power optimization (power/timing trade-offs) • Routing constraints effects

Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment

Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment

Presentation Transcript

Model based Software Testing Test Assessment and Enhancement

Computer Assisted Learning/Multimedia

Line-rate OpenFlow Switch

Real-Time Database Systems and Data Services: Issues and Challenges

The Barbed Wire Museum

Chapter 7: Computer-Assisted Audit Techniques [CAATs]

Chapter 7: Computer-Assisted Audit Techniques [CAATs]

Drug Safety Assessment and Data Mining

Evaluation of Laboratory Data in Nutrition Assessment

12.0 Computer-Assisted Language Learning (CALL )

PIPELINING

Data Analysis within an RtI 2 Framework: Linking Assessment to Intervention

MIPS Pipelining

Using the Assessment Tools Portal

Data Preprocessing

Part 8 Instruction Level Parallelism (ILP) - Pipelining

SELF ASSESSMENT PROCEDURE

CS5365

CS4100: 計算機結構 Pipelining

There Are No “Buts” in Progressive Enhancement [Øredev 2015]