Efficient VLSI architectures for baseband signal processing in wireless base-station receivers

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro, and Behnaam Aazhang This work is supported by Nokia, TI, TATP and NSF

Introduction • A real-time VLSI architecture for channel estimation • Usually neglected, but high computational complexity • Current DSP solutions do not meet real-time • Iterative fixed point algorithm developed • Area-Time Tradeoffs discussed • Area-Constrained (Pico-cells) • Time-Constrained (Theoretical Data Rates) • Area-Time efficient (Real-Time Solution)

Outline • What is multiuser channel estimation? • Need for multiuser channel estimation • Implementation problems • Algorithm enhancements • VLSI architectures • Area-constrained,Time-constrained, Area-Time efficient • Comparisons with DSP solutions • Related Work and Conclusions

Evolution of mobile communications First generation Voice Second/Current generation Voice + Low-rate data (9.6Kbps) Third generation + Voice + High-rate data (2 Mbps/384 Kbps/128 Kbps) + multimedia

Direct Path Channel estimation Noise +MAI Base Station Reflected Path User 1 User 2

Need for channel estimation • To compensate for unknown fading amplitudes and asynchronous delays. • Detector performance depends on accuracy of channel estimator

Computing channel estimates • Computed by sending a training sequence of known bits to the receiver. • When absent, detected bits can be used to update estimates in a decision feedback mode for tracking. • Importance usually neglected • May exceed detector complexity

Baseband signal processing Antenna Multiple Users Detection Decoding Detected Bits Training Tracking Channel estimation Base-Station Receiver

Implementation complexity • Matrix inversions (size 32x32) per window • Unable to meet real-time on DSPs [Asilomar’99] • VLSI fixed-point architectures for matrix inversions • Precision problems • Typically, simpler single-user sliding correlator structures used.

Iterative scheme for channel estimation • Method of Gradient Descent • Stable convergence behavior • Same Performance • Simpler Bit-Streaming Hardware Implementation

Comparison of Bit Error Rates (BER) -1 10 -2 BER 10 O(K2N) MF ActMF ML ActML O(K3+K2N) -3 10 4 5 6 7 8 9 10 11 12 Signal to Noise Ratio (SNR) Simulations - Static multipath channel SINR = 0 Paths =3 Preamble L =150 Spreading N = 31 Users K = 15

0 10 MF - Static MF - Tracking ML - Static ML - Tracking -1 10 BER -2 10 -3 10 4 5 6 7 8 9 10 11 12 SNR Fading channel with tracking Doppler = 10 Kmph

Area-Time Tradeoffs • Design for 32 users (K) and spreading code (N) 32 • Target Data Rate = 128 Kbps • Low Power Issues ignored! • Area-Constrained Architecture • Pico-cells ; lower data rates • Time-Constrained Architecture • Maximum achieve-able data rates • Area-Time Efficient Architecture • Real-Time with minimum area overhead

Tracking Window L Correlation Matrices (Per Bit) Iterate Detected Bits M UX b0 (2K,1) Rbr O(2KN,8) Pilot Bits b(2K,1) A O(4K2N,8) Data Channel Estimate to Detector M UX r0 (N,8) Rbb O(2K2,8) Pilot r(N,8) TIME Task Decomposition

Architecture Design: Auto-correlation • b = {+1,-1} • Multiplication is a XNOR operation • Entire matrix can be updated sequentially or in parallel using XNOR gates • Auto-correlation matrix implemented as an UP/DOWN counter(s)

Architecture Design: Cross-Correlation • b = {+1,-1}, r = 8-bit integer vector (complex) • Multiplications reduce to additions/subtractions • Entire matrix (complex) can be updated sequentially or in parallel using 8-bit adders • Cross-correlation matrix stored as RAM.

Architecture Design: Channel Estimate • A = 8-bit integer matrix (complex) • µ << 1 : Truncated Multiplication [Schulte’93] • Matrix-matrix (real-complex) multiplication of integers • Forms the bottleneck • Can be done sequentially with a single multiplier or totally parallel or partially parallel • Concentrate on multiplication for area-time tradeoffs!

b i A Anew Rbb j 8 8 8 1 8 Load Store MUX EN 1 DEMUX 1 MUX Counter 1 U/D 8 8 8 b0 1 MAC Subtract i j 16 8 Rbr 1 8 >> Subtract 1 8 16 Add/ Sub Add/ Sub 1 8 8 1 j j r r0 Area-Constrained Architecture b b0

Area-constrained Architecture: Hardware Requirements

Time-constrained Architecture K(2K-1)*1 2K*1 M U X b b*bT b0 b0*b0T K(2K-1)*1 Channel Estimate 2K*1 Rbb A 2K*1 2K2*8 2KN*8 MUX Mult Subtract r M U X 2K*1 2KN*8 N*8 2KN*16 >> Rbr Subtract r0 N*8 2KN*8 2KN*16 N*8

b (2K) Array of Counters a b c d a·b a·c a·d b·c b·d c·d Rbb (2K2*8) bbT(K*{2K-1}*1) Auto-correlation Update in Parallel 1 bbT(i,j) U/D# U/D# Array of XNORs Counter Counter Rbb(i,j) Rbb(i,i)

b (2K*1) r (N*8) a b c d b(i) Add/ Sub# 1 Rbr(2KN*8) 8 8 Adder r(j) Rbr(i,j) Cross-Correlation Update in Parallel

Time-constrained Architecture: Hardware Requirements

2K*1 Counters MUX 2K*1 2K*8 b0*b0T b*bT A Anew Rbb 2K*1 2K*1 1*8 2K*8 2K*8 b b0 DEMUX Mult MUX 2K*1 2K*1 2K*8 MUX 1*16 Subtract r 1*1 1*8 M U X N*8 1*8 Adder >> Subtract r0 1*8 1*8 1*16 N*8 Load Store Rbr Area-Time Efficient Architecture

Area-Time Efficient Architecture: Hardware Requirements

DSP Comparisons • DSPs unable to exploit bit-level parallelism • Inefficient storage of bits • Replacing multiplications by additions/subtractions

64-bit Register D[i][j] 8 8 +/- +/- 8 8-bit Control Register b[i] 64-bit Register D[i][j] Related Work: DSP Extensions (Cross-Correlation) For i = 1..8, j= 1..8 D[i][j] = D[i][j] + b[i]*C[j]

Related Work: Online Arithmetic • Multiuser Detection • Need to compute only the Sign Bit (Most Significant Digit ) • No back-conversion to conventional representation • complex-number representation possible • Integration with channel estimation also.

Related Work : DSP-FPGA solutions • Multiple DSP-FPGA task partitioning • Bit level parallelism on FPGAs • Multiplications on DSPs. • Sundance Multi-DSP System • 2 TI C67 DSPs • 2 Xilinx Virtex FPGAs • http://www.sundance.com

Conclusions • Real-Time VLSI architecture for multiuser channel estimation • Iterative fixed-point algorithm developed to avoid matrix inversions • Area-Time Tradeoffs discussed • Area-Constrained (Pico-cells) • Time-Constrained (Data Rates) • Area-Time efficient (Real-Time) • VLSI architectures better exploit bit-level computations and parallelism to meet real-time constraints than DSPs.

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers

Presentation Transcript

Mixed Signal VLSI

VLSI Architectures 048878

Statistical Signal Processing for Sensing and Mitigating Impulsive Noise in Communication Receivers

Unified Architectures for Efficient and Compact Crypto-Processing

Base Station

Base Station Association Game in Multi-cell Wireless Network

Security Mechanism for Home Base Station in Wireless Residential Networks

ECE 734 VLSI Array Structures for Digital Signal Processing

Testbed for Wireless Adaptive Signal Processing Systems

Baseband Architecture Design for Future Wireless Base-Station Receivers

ELEC692 VLSI Signal Processing Architecture Lecture 7

Signal Processing for Wireless Communications: Design, Tools, Architectures

Optimal Base Station Selection for Anycast Routing in Wireless Sensor Networks

Base Station Antenna Considerations in Wireless Network Deployment

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers

Advanced Topics in Signal Processing for Wireless Communications

SWAPs: Re-thinking mobile and base-station architectures

Architectures for Baseband Processing in Future Wireless Base-Station Receivers

VLSI Signal Processing

Mapping Signal Processing Kernels to Tiled Architectures

VL7101 VLSI SIGNAL PROCESSING

VLSI SIGNAL PROCESSING