290 likes | 438 Views
A programmable communications processor for future wireless systems. Sridhar Rajagopal Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang. This work has been supported by Nokia, TI, TATP and NSF. Overview of research at Rice. Center for Multimedia Communications
E N D
A programmable communications processor for future wireless systems Sridhar Rajagopal Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang This work has been supported by Nokia, TI, TATP and NSF
Overview of research at Rice • Center for Multimedia Communications • Behnaam Aazhang (wireless communications) • Joseph R. Cavallaro (VLSI signal processing) http://cmc.rice.edu • Computer Architecture • Scott Rixner (Microprocessor architecture) • Vijay Pai (Simulators, Network Processors) http://www.cs.rice.edu/CS/Architecture
Baseband Programmable A/D Wireless Mobile RF Unit D/A device Communications Processor Motivation Mobile: Switch between standards and between parameters Base-station: varying no. of users with different parameters Programmability - flexibility is good
GPP DSP Performance Flexibility FPGA VLSI Motivation
Lower bounds on + and * for a 500 MHz system Estimation, Detection and Decoding in a W-CDMA multiuser system 3 10 FAST FADING (estimation every 10 bits) MEDIUM FADING (estimation every 100 bits) 2 10 Adders/Multipliers required to meet real-time SLOW FADING (estimation every 1000 bits) 1 10 DATA RATES Add Mul 0 10 0 50 100 150 200 250 300 Number of users
The Problem • Algorithms well understood at data-flow level • Can design real-time systems in VLSI. • Pushing implementation higher in the chain • Current DSPs not powerful enough for our application • Use an architecture simulator to design our own
Proposed solution < x cm Programmable Processor for 4G wireless systems < x cm Future wireless architectures x = 2.5 (W-CDMA BS) x = 2.0 (W-LAN BS) x = 1.5 (Mobile Handset) Current solutions to meet real-time (Racks of DSPs)
Algorithm (in Matlab) New architecture design New Algorithm Characteristics? Complexity ? Real-Time (Area/Power) Requirements Parameter-free Architecture Design Operation Count Parallelize ? Fixed point ? Processor Architecture Parameters (# Functional units, # registers, # memory ....) Compiler Architecture Synthesizer Architecture Code Future Work Ph.D. Thesis Outline
Advantages of this solution • Fast and smooth transition to future standards that simultaneously meets real-time and other constraints • Avoids re-designing the system from scratch • Joint algorithm–architecture hardware-software co-design • Matlab code can be re-used when new standards are being designed. • Tries to account for data rate increases and future algorithm changes
Past research contributions Multiuser channel estimation Multiuser detection Distant Past Algorithms VLSI Task-partitioning Parallelism Pipelining FPGA System Design Recent Past Conventional arithmetic On-line arithmetic DSP Recent and Near Future Architecture innovations Functional unit design and usage IMAGINE
Contents • Motivation • Parallel algorithms for estimation/detection/decoding • The “Imagine” simulator • Performance comparisons and results
Typical workload representation (Base-station) • Equalization? • FFT • Viterbi decoding • Multiuser channel estimation • Multiuser detection • Viterbi decoding • Turbo decoding • Multiple antenna systems (MIMO) Wireless LAN W-CDMA Advanced receiver schemes
Parallel estimation/detection/decoding • Multiuser estimation • replaced matrix inversion by gradient descent • Multiuser detection • Parallel Interference Cancellation (PIC) • Pipelined algorithm that avoids block-based detection • Viterbi decoding • Trellis structures suited for decoding • Register exchange for survivor memory • No traceback latency
Estimation/Detection (64,32 sizes) Multiuser Estimation Kernel 1,2,3 Massaging matrices for detection Kernel 4, 5 Multiuser Detection Kernel 6, 7
a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable Trellis X(0) X(0) X(0) X(0) X(0) X(0) X(1) X(1) X(1) X(2) X(1) X(1) X(2) X(2) X(4) X(2) X(2) X(2) X(3) X(3) X(6) X(3) X(3) X(3) X(4) X(4) X(4) X(8) X(4) X(4) X(10) X(5) X(5) X(5) X(5) X(5) X(12) X(6) X(6) X(6) X(6) X(6) X(7) X(7) X(7) X(7) X(14) X(7) X(8) X(8) X(8) X(8) X(1) X(8) X(9) X(3) X(9) X(9) X(9) X(9) X(10) X(10) X(5) X(10) X(10) X(10) X(11) X(7) X(11) X(11) X(11) X(11) X(12) X(9) X(12) X(12) X(12) X(12) X(13) X(13) X(13) X(11) X(13) X(13) X(13) X(14) X(14) X(14) X(14) X(14) X(15) X(15) X(15) X(15) X(15) X(15) Trellis for rate ½ code with K = 5 Upper bound on parallel clusters for good FU utilization : N/2k Maximum 8 parallel units for rate ½ with 16 states
Survivor Management in Viterbi • Two techniques • Traceback : Commonly used • Register Exchange • Traceback is good for VLSI architectures • Drawback: Sequential and additional latency • Register exchange is good for programmable solutions • Parallel updates • Packing decoded bits in the register needs to access the entire register
Contents • Motivation • Parallel algorithms for estimation/detection/decoding • The “Imagine” simulator • Performance comparisons and results
SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The IMAGINE architecture
Why IMAGINE simulator? • RSIM, SimpleScalar: GPP simulators • Great for media processing algorithms • Has a VLIW-based cluster -- DSP comparisons • A good base architecture : 1024-pt FFT
Simulator knobs that we can turn • Cycle-accurate simulator • Varying number of Functional units and their design • Varying memory, register sizes • Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead … • Almost anything can be changed, some changes easier than others!
Programming Imagine • 2 level C++ programming • StreamC: • transfers streams of data between main memory and stream register file (SRF) • KernelC: • transfers streams from the SRF to the ALU clusters • Code optimized to the number of ALU clusters and the size of the data
Contents • Motivation • Parallel algorithms for estimation/detection/decoding • The “Imagine” simulator • Performance comparisons and results
Communication (waiting for input) Kernel 2 (mmult) for 3 +,2*Adders have limited FU utilizationO(N3) *, O(N3) +Multipliers 100% in loopDivider not being utilizedReplace / with * FU unavailable (input ready but FU busy) TIME LOOP
Kernel 2 (mmult)for 3 +,3*better adder utilization needs sufficient registers for scaling [register allocation may fail]code may also need slight tuning of variables for optimization TIME
Kernel computational time Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles
Memory operations Kernels (Micro-controller executing) Initialization Idle time between kernels Communication overhead
-2 10 1 DSP -3 10 2 DSPs -4 Execution time (in seconds) 10 IMAGINE with increasing functional units Efficiency = ? -5 10 Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user I Our architecture based on Imagine -6 10 0 5 10 15 20 25 30 35 Users Comparisons with TI C6701 DSPs
Future work • Real-time design possible with larger number of functional units but efficiency is the key • Eliminating communication stalls between kernels • Support for matrix transposes and bit-level operations • Power and area constraints • Scalability with data rates – Boundaries of architecture • Handset algorithms
Conclusions • Various programmable architectures can be investigated and implemented for future systems depending on algorithms, time, area and power constraints QUICKLY • The insights gained from the design can be applied to DSPs and other processors with constraints on time, area and power. http://www.ece.rice.edu/~sridhar/ sridhar@rice.edu