Programmable processors for wireless base-stations

Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 9, 2003

Fact#1: Wireless rates  clock rates 4 10 Clock frequency (MHz) 3 10 2 10 W-LAN data rate (Mbps) 1 10 0 10 -1 10 Cellular data rate (Mbps) -2 10 -3 10 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Need to process 100X more bits per clock cycle today than in 1996 4 GHz 54-100 Mbps 200 MHz 2-10 Mbps 1 Mbps 9.6 Kbps Source: Intel, IEEE 802.11x, 3GPP

Fact#2: base-stations need horsepower RF Network Interface Baseband processing LNA E1/T1 Chip level Symbol BSC/RNC or Demodulation Detection Interface Packet Despreading RF RX Network ADC Packet/ Channel Symbol Circuit Switch DDC estimation Decoding Control Frequency Power Measurement and Gain Power Supply Offset Control (AGC) and Control Compensation Unit Sophisticated signal processing for multiple users Need 100-1000s of arithmetic operations to process 1 bit Source: Texas Instruments

Need  100 ALUs in base-stations Example: 1000 arithmetic operations/bit with 1 bit/10 cycles • 100 arithmetic operations/clock cycle Base-stations need  100 ALUs • irrespective of the type of (clocked) architecture

Fact #3: Base-stations need power-efficiency* Wireless gets blacked out too Trying to use your cell phone during the blackout was nearly impossible. What went wrong?August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer Wireless systems getting denser • More base-stations per unit area • operational and maintenance costs Architectures first tested on base-stations *implies does not waste power – does not imply low power

Fact #4: Base-stations need flexibility* • Wireless systems are continuously evolving • New algorithms designed and evaluated • allow upgrading, co-existing, minimize design time, reuse • Flexibility needed for power-efficiency • Base-stations rarely operate at full capacity • Varying users, data rates, spreading, modulation, coding • Adapt resources to needs *how much flexibility? – as flexible as possible

Fact #5: Current base-stations not flexible / not power-efficient DSP(s) ‘Symbol rate’ processing RF ‘Chip rate’ Control and (Analog) processing protocol Decoding ASIC(s) Co-processor(s) DSP or and/or and/or RISC ASSP(s) ASIC(s) processor and/or FPGA(s) Change implies re-partitioning algorithms, designing new hardware Design done for the worst case – no adaptation with workload Source: [Baines2003]

Thesis addresses the following problem • design a base-station • supports 100’s of ALUs • power-efficient (adapts resources to needs) • as flexible as possible • How many ALUs at what clock frequency? • HYPOTHESIS: • Programmable* processors for wireless base-stations *how much programmable? – as programmable as possible

Programmable processors • No processor optimization for specific algorithm • As programmable as possible • Example: no instruction for Viterbi decoding • FPGAs, ASICs, ASIPs etc. notconsidered • Use characteristics of wireless systems • precision, parallelism, operations,.. • MMX extensions for multimedia

Single processors won’t do (1) Find ways for increasing clock frequency • C64x DSP: 600 – 720 – 1GHz – 100GHz? • Easiest solution but physical limits to scaling f • Not good for power, given cubic dependence with f (2) Increasing ALUs • Limited instruction level parallelism (ILP,MMX) • Register file area, ports explosion • Compiler issues in extracting more ILP (3) Multiprocessors

Related work - Multiprocessors Multiprocessors Control Reconfigurable* Cannot scale to support 100’s of arithmetic units processors MIMD SIMD (Multiple Instructions (Single Instruction : Multiple Data) Multiple Data) Data Parallel RAW Chameleon picoChip Single chip Multi-chip Array : TI TMS320C40 DSP : Sundance TM ClearSpeed Cm* MasPar Vector Illiac-IV BSP : CODE Multi-threading Vector IRAM Chip (MT) Cray 1 multiprocessor Stream (CMP) : Clustered VLIW Sandbridge SandBlaster DSP : Cray MTA TI TMS320C8x DSP : Imagine Sun MAJC TI TMS320C6x DSP Hydra TM Motorola RSVP PowerPC RS64IV Multiflow TRACE IBM Power4 Alpha 21464 Alpha 21264 *Reconfigurable processor uses reconfiguration for execution time benefits

Challenges in proving hypothesis • Architecture choice for design exploration • SIMD generally more programmable* than reconfigurable • Compiler, simulators, tools and support play a major role • Benchmark workloads need to be designed • Previously done as ASICs, so none available • Not easy – finite precision, algorithms changing • Need detailed knowledge of wireless algorithms, architectures, mapping, compilers, design tools *Programmable here refers to ease of use and write code for

Architecture choice: Stream processors • State-of-the-art programmable media processors • Can scale to 1000’s of arithmetic units [Khailany 2003] • Wireless algorithms have similar characteristics • Cycle-accurate simulator with open-source code • Parameters such as ALUs, register files can be varied • Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead … • Almost anything can be changed, some changes easier than others!

Thesis contributions • Mapping algorithms on stream processors • designing data-parallel algorithm versions • tradeoffs between packing, ALU utilization and memory • reduced inter-cluster communication network • Improve power efficiency in stream processors • adapting compute resources to workload variations • varying voltage and frequency to real-time requirements • Design exploration between #ALUs and clock frequency to minimize power consumption • fast real-time performance prediction

Outline • Background • Wireless systems • Stream processors • Contribution #1 : Mapping • Contribution #2 : Power-efficiency • Contribution #3 : Design exploration • Broader impact and limitations

Wireless workloads : 2G (Basic) 2G physical layer signal processing User 1 User 1 Code Viterbi Matched decoder Filter MAC Sliding and correlator Network layers Received signal User K User K after Code DDC Viterbi Matched decoder Filter Sliding correlator 32 users 16 Kbps/user Single-user algorithms (other users noise) > 2 GOPs

3G Multiuser system 3G physical layer signal processing Multiuser detection User 1 User 1 Code Viterbi Matched decoder Received Filter signal Parallel MAC after Interference and DDC Cancellation Network Stages layers User K User K Code Viterbi Matched decoder Filter Multiuser channel estimation 32 users 128 Kbps/user Multi-user algorithms (cancels interference) > 20 GOPs

4G MIMO system M antennas 4G physical layer signal processing User 1, Antenna 1 User 1 Code Chip level LDPC Matched Equalization decoder Filter Received signal after DDC Channel Estimation User 1, Antenna T Code Chip level Matched Equalization Filter MAC and Network Channel layers estimation User K, Antenna 1 User K Code Chip level LDPC Matched Equalization decoder Filter Channel Estimation User K, Antenna T Code Chip level Matched Equalization Filter Channel estimation 32 users 1 Mbps/user Multiple antennas (higher spectral efficiency, higher data rates) > 200 GOPs

Programmable processors int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bitspacked for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } Instruction Level Parallelism (ILP) - DSP Subword Parallelism (MMX) - DSP Data Parallelism (DP) – Vector Processor • DP can decrease by increasing ILP and MMX – Example: loop unrolling DP ILP MMX

Stream Processors : multi-cluster DSPs Internal Memory micro controller micro controller + + ILP MMX + * * * Memory: Stream Register File (SRF) + + + + + + + + … ILP MMX + + + + * * * * * * * * * * * * DP adapt clusters to DP Identical clusters, same operations. Power-down unused FUs, clusters VLIW DSP (1 cluster)

Outline Contribution #1 • Mapping algorithms to stream processors (parallel, fixed pt) • Tradeoffs between packing, ALU utilization and memory • Reduced inter-cluster communication network

Packing • Packing introduced around 1996 for exploiting subword parallelism • Intel MMX • Subword parallelism never looked back • Integrated into all current microprocessors and DSPs • SIMD + MMX : Stream processor/vector IRAM : 2000 + • relatively new concept • Not necessarily useful in SIMD processors • May add to inter-cluster communication

Packing may not be useful a 3 4 5 6 7 8 1 2 Multiplication p 3 5 7 1 q 4 6 8 2 Algorithm: Re-ordering data short a; p 3 x x 1 int y; m 7 x x 5 { for(i= 1; i < 8 ; ++i) n x 2 4 x y[i] = a[i]*a[i]; q x 6 8 x Add } p 3 2 4 1 q 7 6 8 5 Re-ordering data p 2 3 4 1 q 6 7 7 8 5 Packing uses odd-even grouping

Data re-ordering in memory • Matrix transpose • Common in wireless communication systems • Column access to data expensive • Re-ordering data inside the ALUs • Faster • Lower power

Trade-offs during memory re-ordering ALUs Memory ALUs Memory ALUs t t t 1 1 1 Transpose Transpose t t t t mem 3 alu mem t 2 t t 2 2 t = t + t 2 stalls t = t + t t = t 0 < t < t 2 alu 2 stalls mem (c) (b) (a)

Transpose uses odd-even grouping N IN B C D 0 A A B C D 3 4 2 1 OUT M A 1 B 2 M /2 1 3 4 2 D 4 3 C Repeat LOG(M ) times { IN = OUT; }

ALU Bandwidth > Memory Bandwidth Transpose in memory (t ): DRAM 8 cycles mem Transpose in memory (t ): DRAM 3 cycles mem 5 10 Transpose in ALU (t ) alu Execution time (cycles) 4 10 3 10 4 10 Matrix sizes (32x32, 64x64, 128x128)

Viterbi needs odd-even grouping ACS in SWAPs Regular ACS DP vector X(0) X(0) X(0) X(0) X(1) X(1) X(2) X(1) X(2) X(2) X(2) X(4) X(3) X(3) X(6) X(3) X(4) X(4) X(8) X(4) X(5) X(10) X(5) X(5) X(6) X(6) X(6) X(12) X(14) X(7) X(7) X(7) X(8) X(8) X(8) X(1) X(9) X(9) X(9) X(3) X(5) X(10) X(10) X(10) X(11) X(7) X(11) X(11) X(12) X(9) X(12) X(12) X(13) X(13) X(13) X(11) X(14) X(13) X(14) X(14) X(15) X(15) X(15) X(15) Exploiting Viterbi DP in SWAPs: • Use Register exchange (RE) instead of regular traceback • Re-order ACS, RE

Performance of Viterbi decoding 1000 K = 9 K = 7 DSP K = 5 100 Frequency needed to attain real-time (in MHz) 10 Max DP 1 1 10 100 Number of clusters Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

Pattern in inter-cluster comm • Broadcasting • Matrix-vector multiplication, matrix-matrix multiplication, outer product updates • Odd-even grouping • Transpose, Packing, Viterbi decoding

Odd-even grouping 4 Clusters Data 0/4 1/5 2/6 3/7 0 1 2 3 4 5 6 7  0 2 4 8 1 3 5 7 Inter-cluster communication Entire chip length Limits clock frequency Limits scaling 2 2 O(C ) wires, O(C ) interconnections, 8 cycles

A reduced inter-cluster comm network 4 Clusters 0/4 1/5 2/6 3/7 Data Multiplexer Broadcasting support Registers Odd-even (pipelining) grouping Demultiplexer O(C log(C) ) wires, O(C ) interconnections, 8 cycles only nearest neighbor interconnections

Outline Contribution #2 : Power-efficiency High performance is low power - Mark Horowitz

Flexibility needed in workloads 25 2G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) 20 15 Operation count (in GOPs) 10 5 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) (Users, Constraint lengths) Note: GOPs refer only to arithmetic computations Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi to ~23 GOPs for 32 users, constraint 9 viterbi

Flexibility affects Data Parallelism* *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling U - Users, K - constraint length, N - spreading gain, R - decoding rate

Adapting #clusters to Data Parallelism No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off C C C C C C C Turned off using voltage gating to eliminate static and dynamic power dissipation Adaptive Multiplexer Network C C C C

Cluster utilization variation 100 (4,9) (4,7) 50 0 0 5 10 15 20 25 30 100 (8,9) (8,7) 50 0 Cluster Utilization 0 5 10 15 20 25 30 100 50 (16,9) (16,7) 0 0 5 10 15 20 25 30 100 (32,9) 50 (32,7) 0 0 5 10 15 20 25 30 Cluster Index Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi

Frequency variation 1200 Mem Stall uC Stall Busy 1000 800 Real-time Frequency (in MHz) 600 400 200 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)

Operation • Dynamic Voltage-Frequency scaling when system changes significantly • Users, data rates … • Coarse time scale (every few seconds) • Turn off clusters • when parallelism changes significantly • Memory operations • Exceed real-time requirements • Finer time scales (100’s of microseconds)

Power : Voltage Gating & Scaling Power can change from 12.38 W to 300 mW depending on workload changes

Outline Contribution #3 : Design exploration • How many adders, multipliers, clusters, clock frequency • Quickly predict real-time performance

Deciding ALUs vs. clock frequency • No independent variables • Clusters, ALUs, frequency, voltage (c,a,m,f) • Trade-offs exist • How to find the right combination for lowest power!

Static design exploration Dynamic part (Memory stalls Microcontroller stalls) Execution Time Static part (computations) also helps in quickly predicting real-time performance

Sensitivity analysis important • We have a capacitance model [Khailany2003] • All equations not exact • Need to see how variations affect solutions

Design exploration methodology • 3 types of parallelism: ILP, MMX, DP • For best performance (power) • Maximize the use of all • Maximize ILP and MMX at expense of DP • Loop unrolling, packing • Schedule on sufficient number of adders/multipliers • If DP remains, use clusters = DP • No other way to exploit that parallelism

Setting clusters, adders, multipliers • If sufficient DP, linear decrease in frequency with clusters • Set clusters depending on DP and execution time estimate • To find adders and multipliers, • Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time • Put all numbers in power equation • Compare increase in capacitance due to added ALUs and clusters with benefits in execution time • Choose the solution that minimizes the power

Design exploration For sufficiently large #adders, #multipliers per cluster Explore Algorithm 1 : 32 clusters (t1) Explore Algorithm 2 : 64 clusters (t2) Explore Algorithm 3 : 64 clusters (t3) Explore Algorithm 4 : 16 clusters (t4) DP ILP

Clusters: frequency and power 4 1 10 0.9 0.8 0.7 Power µ f 2 Power µ f 0.6 3 Frequency (MHz) Power µ f Normalized Power 3 0.5 10 0.4 0.3 0.2 0.1 2 0 10 0 10 20 30 40 50 60 70 0 1 2 10 10 10 Clusters Clusters 32 clusters at frequency = 836.692 MHz (p = 1) 64 clusters at frequency = 543.444 MHz (p = 2) 64 clusters at frequency = 543.444 MHz (p = 3) 3G workload

ALU utilization with frequency (78,18) (78,27) 1100 (78,45) 1000 900 (64,31) Real-Time Frequency (in MHz) with FU utilization(+,*) 800 (50,31) (65,46) 700 (38,28) 600 (51,42) (67,62) (32,28) 3 500 (42,37) 2.8 1 2.6 1.5 (33,34) (55,62) 2.4 2 2.2 2.5 (43,56) 2 3 1.8 #Multipliers 3.5 (36,53) 1.6 #Adders 4 1.4 4.5 1.2 1 5 3G workload

Power variations with f and 

Programmable processors for wireless base-stations