A programmable communications processor for future wireless systems

A programmable communications processor for future wireless systems Sridhar Rajagopal Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang This work has been supported by Nokia, TI, TATP and NSF

Overview of research at Rice • Center for Multimedia Communications • Behnaam Aazhang (wireless communications) • Joseph R. Cavallaro (VLSI signal processing) http://cmc.rice.edu • Computer Architecture • Scott Rixner (Microprocessor architecture) • Vijay Pai (Simulators, Network Processors) http://www.cs.rice.edu/CS/Architecture

Baseband Programmable A/D Wireless Mobile RF Unit D/A device Communications Processor Motivation Mobile: Switch between standards and between parameters Base-station: varying no. of users with different parameters Programmability - flexibility is good

GPP DSP Performance Flexibility FPGA VLSI Motivation

Lower bounds on + and * for a 500 MHz system Estimation, Detection and Decoding in a W-CDMA multiuser system 3 10 FAST FADING (estimation every 10 bits) MEDIUM FADING (estimation every 100 bits) 2 10 Adders/Multipliers required to meet real-time SLOW FADING (estimation every 1000 bits) 1 10 DATA RATES Add Mul 0 10 0 50 100 150 200 250 300 Number of users

The Problem • Algorithms well understood at data-flow level • Can design real-time systems in VLSI. • Pushing implementation higher in the chain • Current DSPs not powerful enough for our application • Use an architecture simulator to design our own

Proposed solution < x cm Programmable Processor for 4G wireless systems < x cm Future wireless architectures x = 2.5 (W-CDMA BS) x = 2.0 (W-LAN BS) x = 1.5 (Mobile Handset) Current solutions to meet real-time (Racks of DSPs)

Algorithm (in Matlab) New architecture design New Algorithm Characteristics? Complexity ? Real-Time (Area/Power) Requirements Parameter-free Architecture Design Operation Count Parallelize ? Fixed point ? Processor Architecture Parameters (# Functional units, # registers, # memory ....) Compiler Architecture Synthesizer Architecture Code Future Work Ph.D. Thesis Outline

Advantages of this solution • Fast and smooth transition to future standards that simultaneously meets real-time and other constraints • Avoids re-designing the system from scratch • Joint algorithm–architecture hardware-software co-design • Matlab code can be re-used when new standards are being designed. • Tries to account for data rate increases and future algorithm changes

Past research contributions Multiuser channel estimation Multiuser detection Distant Past Algorithms VLSI Task-partitioning Parallelism Pipelining FPGA System Design Recent Past Conventional arithmetic On-line arithmetic DSP Recent and Near Future Architecture innovations Functional unit design and usage IMAGINE

Contents • Motivation • Parallel algorithms for estimation/detection/decoding • The “Imagine” simulator • Performance comparisons and results

Typical workload representation (Base-station) • Equalization? • FFT • Viterbi decoding • Multiuser channel estimation • Multiuser detection • Viterbi decoding • Turbo decoding • Multiple antenna systems (MIMO) Wireless LAN W-CDMA Advanced receiver schemes

Parallel estimation/detection/decoding • Multiuser estimation • replaced matrix inversion by gradient descent • Multiuser detection • Parallel Interference Cancellation (PIC) • Pipelined algorithm that avoids block-based detection • Viterbi decoding • Trellis structures suited for decoding • Register exchange for survivor memory • No traceback latency

Estimation/Detection (64,32 sizes) Multiuser Estimation Kernel 1,2,3 Massaging matrices for detection Kernel 4, 5 Multiuser Detection Kernel 6, 7

a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable Trellis X(0) X(0) X(0) X(0) X(0) X(0) X(1) X(1) X(1) X(2) X(1) X(1) X(2) X(2) X(4) X(2) X(2) X(2) X(3) X(3) X(6) X(3) X(3) X(3) X(4) X(4) X(4) X(8) X(4) X(4) X(10) X(5) X(5) X(5) X(5) X(5) X(12) X(6) X(6) X(6) X(6) X(6) X(7) X(7) X(7) X(7) X(14) X(7) X(8) X(8) X(8) X(8) X(1) X(8) X(9) X(3) X(9) X(9) X(9) X(9) X(10) X(10) X(5) X(10) X(10) X(10) X(11) X(7) X(11) X(11) X(11) X(11) X(12) X(9) X(12) X(12) X(12) X(12) X(13) X(13) X(13) X(11) X(13) X(13) X(13) X(14) X(14) X(14) X(14) X(14) X(15) X(15) X(15) X(15) X(15) X(15) Trellis for rate ½ code with K = 5 Upper bound on parallel clusters for good FU utilization : N/2k Maximum 8 parallel units for rate ½ with 16 states

Survivor Management in Viterbi • Two techniques • Traceback : Commonly used • Register Exchange • Traceback is good for VLSI architectures • Drawback: Sequential and additional latency • Register exchange is good for programmable solutions • Parallel updates • Packing decoded bits in the register needs to access the entire register

SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The IMAGINE architecture

Why IMAGINE simulator? • RSIM, SimpleScalar: GPP simulators • Great for media processing algorithms • Has a VLIW-based cluster -- DSP comparisons • A good base architecture : 1024-pt FFT

Simulator knobs that we can turn • Cycle-accurate simulator • Varying number of Functional units and their design • Varying memory, register sizes • Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead … • Almost anything can be changed, some changes easier than others!

Programming Imagine • 2 level C++ programming • StreamC: • transfers streams of data between main memory and stream register file (SRF) • KernelC: • transfers streams from the SRF to the ALU clusters • Code optimized to the number of ALU clusters and the size of the data

Communication (waiting for input) Kernel 2 (mmult) for 3 +,2*Adders have limited FU utilizationO(N3) *, O(N3) +Multipliers 100% in loopDivider not being utilizedReplace / with * FU unavailable (input ready but FU busy) TIME LOOP

Kernel 2 (mmult)for 3 +,3*better adder utilization needs sufficient registers for scaling [register allocation may fail]code may also need slight tuning of variables for optimization TIME

Kernel computational time Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles

Memory operations Kernels (Micro-controller executing) Initialization Idle time between kernels Communication overhead

-2 10 1 DSP -3 10 2 DSPs -4 Execution time (in seconds) 10 IMAGINE with increasing functional units Efficiency = ? -5 10 Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user I Our architecture based on Imagine -6 10 0 5 10 15 20 25 30 35 Users Comparisons with TI C6701 DSPs

Future work • Real-time design possible with larger number of functional units but efficiency is the key • Eliminating communication stalls between kernels • Support for matrix transposes and bit-level operations • Power and area constraints • Scalability with data rates – Boundaries of architecture • Handset algorithms

Conclusions • Various programmable architectures can be investigated and implemented for future systems depending on algorithms, time, area and power constraints QUICKLY • The insights gained from the design can be applied to DSPs and other processors with constraints on time, area and power. http://www.ece.rice.edu/~sridhar/ sridhar@rice.edu

A programmable communications processor for future wireless systems

A programmable communications processor for future wireless systems

Presentation Transcript

Wireless Personal Communications Systems – CSE5807

A Programmable Wireless Sensing System for Structural Monitoring

Floating point processor for Programmable calculator

WIRELESS COMMUNICATIONS ELV SYSTEMS

Wireless Communications

Programmable Systems

A Programmable Coprocessor Architecture for Wireless Applications

Processor Issues For Wireless Communications

Programmable processors for wireless base-stations

Wireless Personal Communications Systems – CSE5807

Wireless Personal Communications Systems – CSE5807

Wireless Personal Communications Systems – CSE5807

Wireless Personal Communications Systems – CSE5807

Wireless Communications Systems Research Needs

Entropy Coding on a Programmable Processor Array for Multimedia SoC

OpenRadio A programmable wireless dataplane

Programmable Systems

Programmable processors for wireless base-stations

Reconfigurable Communications Processor

Programmable processors for wireless base-stations

DSPs for future wireless systems

Programmable Systems