300 likes | 402 Views
ASICs. Programmable. Handset architectures. Sridhar Rajagopal sridhar@rice.edu http://www.ece.rice.edu/~sridhar. The support for this work in part by Nokia, TI and NSF is gratefully acknowledged. ro. 2G handsets. DSP for most of the baseband. ASIC for compute-intensive operations
E N D
ASICs Programmable Handset architectures Sridhar Rajagopal sridhar@rice.edu http://www.ece.rice.edu/~sridhar The support for this work in part by Nokia, TI and NSF is gratefully acknowledged
ro 2G handsets DSP for most of the baseband ASIC for compute-intensive operations (spreading etc.) microcontroller for higher layers Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000
DSP for the third generation wireless communicationsU. Ko, M. McMahan and E. Auslander,International Conference onComputer Design,1999pp.516–520 Introduction to W-CDMA SoC design approachH. Chen, VIA Technologies, August 2002 www.itpilot.org.tw/provisional/910802/ INTRODUCTION%20TO%20WCDMA%20SOC%20.PDF Proposed 3G handsets TI VIA Increased number of co-processors as DSPs unable to do most of the baseband
Motivation How does this scale? Do we need a DSP or should we build ASICs? If ASICs, how to build better ASICs? If programmable, how to build better DSPs? If both, how do we mix them better? Answers dependent on • level of programmability needed • area-time-power architecture tradeoffs
ASICs Programmable Rice innovations for ASICs and DSPs ASICs: On-line arithmetic for dynamic truncation Programmable: Scalable Wireless Application-specific Processors (SWAPs) Mix and match : Hybrid SWAPs (H-SWAPs)
Outline • On-line arithmetic for dynamic truncation • SWAPs • H-SWAPs
ASIC designs • Finite precision arithmetic • Faster • Low power • Low area • How to keep finite precision bounded: • Saturation • Truncation
Keeping precision bounded • Example of truncation • Multiplication by in gradient descent • Sign detection • Example of saturation • Avoiding overflows • When probability of useful MSBs are low
Dynamic precision requirements • Precision needs change with algorithms, SNR • Adapt hardware dynamically to save power • 25-35% power reduction possible • Dynamic saturation vs. dynamic truncation • Easy as LSBs first – difficult • No error – significant error • Throughput benefits – no benefits
On-line arithmetic for dynamic truncation • Works Most Significant Digit First • Natural way of truncation • Digit-serial dynamic truncation • Redundant number system error only in LSD • Throughput benefits as digit-serial
0 0 0 R 0 R 0 0 a * b i i a * b i i R Tree R Tree addition addition Level 1 Level 1 R R R R Tree addition Tree addition Result Result t a d*t OL-MF OL a t log(d) CONV-MF (b) On-line arithmetic with full precision (a) Truncated conventional arithmetic R R R 0 0 R a * b i i a * b i i Idle R R Tree addition (Pipeline R B B R Level 1 Bubbles) Tree addition Level 1 R R R B B R B B R R R B R B B Tree addition B B B Tree addition Result Result t a d *t OL-MF eff OL t = constant = 3* t OL OL-MF Sign determined at this point Sign determined at this point. Stop! (d) Dynamically truncated on-line arithmetic (c) Dynamically truncated on-line arithmetic (2 MSDs ) (without truncation error) Example for sign detection
ASIC design conclusion Details : Predrag Using on-line arithmetic for dynamic truncation and conventional arithmetic for dynamic saturation, one can design efficient ASICs for handsets.
Outline • On-line arithmetic for dynamic truncation • SWAPs • H-SWAPs
Programmable architectures • Current DSPs • Not enough functional units (FUs) • Cannot extend to more FUs • Limited Instruction Level Parallelism (ILP) • Cannot support more registers (register area increases quadratically with FUs) • Compilers: difficult to find ILP as FUs increase
Solution • Exploit data parallelism (DP) • Lots available in wireless algorithms • Example: for (i = 1: 1024) { a[i] = b[i] + c[i]; d[i] = b[i] * c[i]; } DP ILP
Internal Memory Internal Memory ILP + + + + + + + + + + + + + + + + + + + + … ILP + + + + + + + + + + * * * * * * * * * * * * * * * * * * * * * * * DP * * * * * * * DSP vs. SWAPs DSP (1 cluster) SWAPs (max. clusters)
SWAPs trade-offs • Same internal memory size as DSPs • Dependent on application, not architecture • Needs more area to support more functional units • Area is not a constraint (power is) • Varying levels of DP in applications • Needs reconfiguration!! • Need to turn off unused clusters • More parallelism lower clock frequency lower voltage low power (CV2f + leakage) in spite of larger area
Example: Viterbi Decoding • Add-Compare-Select (ACS) : trellis interconnect • Re-order for exploiting DP • Traceback – sequential • Use Register Exchange (RE) Exploiting DP in programmable architecture implies: • Re-order ACS • Re-order RE
a. Trellis b. Shuffled Trellis X(0) X(0) X(0) X(0) X(1) X(2) X(1) X(1) X(2) X(4) X(2) X(2) X(6) X(3) X(3) X(3) X(4) X(4) X(8) X(4) X(5) X(5) X(5) X(10) X(6) X(6) X(6) X(12) X(7) X(7) X(14) X(7) X(8) X(1) X(8) X(8) X(9) X(3) X(9) X(9) X(10) X(10) X(5) X(10) X(7) X(11) X(11) X(11) X(9) X(12) X(12) X(12) X(13) X(13) X(13) X(11) X(14) X(14) X(13) X(14) X(15) X(15) X(15) X(15) Re-ordering for parallel Viterbi
Viterbi reconfiguration DP Can be turned OFF Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters)
64-bit Packet 1 Rate ½ Constraint Length 7 Memory accesses 64-bit Packet 2 Rate ½ Constraint Length 9 Kernels (Computation) 64-bit Packet 3 Rate ½ Constraint Length 5
3 10 Actual K = 9 Actual K = 7 Actual K = 5 Regular code Reconfigurable code 2 10 Frequency needed to attain real-time (in MHz) 1 10 0 10 0 1 2 10 10 10 Number of clusters Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz
Actual K = 9 Actual K = 7 Actual K = 5 2 1 0 0 1 2 10 10 10 Virtex II FPGA* Viterbi decoding: Comparisons 3 10 DSP (RE) DSP C64x (w/o co-proc) 10 DP SWAP 10 Task Pipelining Dedicated interconnect 10 128 KHz (1 bit /cycle) FPGA *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong
Salient features of this solution • Any constraint length 10 MHz at 128 Kbps • Same code for all constraint lengths • no need to re-compile or load another code • as long as parallelism/cluster ratio is constant • Exploiting parallelism at 3 levels for real-time: • Instruction Level Parallelism (DSP) • Subword Parallelism (DSP) • Data Parallelism (SWAP)
Problems • Suitable for handsets? - Not yet! • Still too general • Not low power enough!!! • No special customization for the application • Except for a fixed-point architecture • Generic instruction set • Generic ALUs (though can be powered down) • Generic inter-cluster communication network
Outline • On-line arithmetic for dynamic truncation • SWAPs • Hybrid SWAPs (H-SWAPs)
Internal Memory + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + * + + * * + + + * + + + + + + + + + + + + … * * * * * * * * + + + + + + + + + * * * * * * * * * * * * * * * * * * Limited DP Limited DP Limited DP * * * * * * * * * Limited DP DP H-SWAPs (collection of customized mini-SWAPs) Mini-SWAP (limit clusters) H-SWAPs • Trade Data Parallelism for Task Pipelining • Customize each mini-SWAP SWAPs (max. clusters and reconfigure)
Work in progress • How to trade-off task vs. data parallelism? • Power estimation for SWAPs (actual numbers) • Comparisons with ASIC solutions in terms of area-time-power • Evaluation of specialized inter-cluster communication • Specialized instructions (ACS) and arithmetic units (on-line) I am looking for jobs!!!