Algorithms and Architecture s for Future Wireless Base-Stations

Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This work is supported by Texas Instruments, Nokia, Texas Advanced Technology Program and NSF

Overview • Future Base-Stations • Current DSP Implementation • Our Approach • Make Algorithms Computationally effective • Task Partitioning for pipelining, parallelism • Processor Design for Accelerating Wireless TI Meeting

Evolution of Wireless Comm First Generation Voice Second/Current Generation Voice + Low-rate Data (9.6Kbps) Third Generation + Voice + High-rate Data (2 Mbps) + Multimedia W-CDMA TI Meeting

Noise +MAI Base Station Reflected Paths Direct Path User 1 User 2 Communication SystemUplink TI Meeting

Main Processing Blocks Decoding Channel Estimation Detection Baseband Layer of Base-Station Receiver TI Meeting

No Multiuser Detection Proposed Base-Station TI's Wireless Basestation (http://www.ti.com/sc/docs/psheets/diagrams/basestat.htm) TI Meeting

Real -Time Requirements • Multiple Data Rates by Varying Spreading Factors • Detection needs to be done in real-time • 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128 Kbps TI Meeting

4 Data Rate Comparisons for Matched Filter and Multiuser Detector x 10 18 16 14 Targeted Data Rate = 128Kbps 12 10 Projected (8x) Data Rates Achieved 8 Matched Filter(C64)* Multiuser Detector(C64)* 6 Matched Filter(C67) Multiuser Detector(C67) Targeted Data Rate 4 2 C67 at 166MHz 0 9 10 11 12 13 14 15 Number of Users Current DSP Implementation TI Meeting

Complexity • Algorithm Choice Limited by Complexity • Multistage reduces data rate by half. • Main Features • Matrix based operations • High levels of parallelism • Bit level computations • 32x32 problem size for the Detector shown • Estimation, Decoding assumed pipelined. TI Meeting

Reasons • Sophisticated, Compute-Intensive Algorithms • Need more MIPs/FLOPs performance • Unable to fully exploit pipelining or parallelism • Bit - level computations / Storage TI Meeting

Our Approach • Make algorithms computationally effective • without sacrificing error rate performance • Task Partitioning on Multiple Processing Elements • DSPs : Core • FPGAs : Application Specific / Bit-level Computations • Processor with reconfigurable support and extensions for wireless TI Meeting

Algorithms • Channel Estimation • Avoid inversion by iterative scheme • Detection • Avoid block-based detection by pipelining TI Meeting

time bi+1 bi ri Computations Involved delay • Model • Compute Correlation Matrices Bits of K async. users aligned at times I and I-1 Received bits of spreading length N for K users TI Meeting

Multishot Detection Solve for the channel estimate, Ai Multishot Detection TI Meeting

Differencing Multistage Detection • Stage 0- Matched Filter • Stage 1 • Successive Stages S=diag(AHA) y - soft decision d - detected bits (hard decision) TI Meeting

Iterative Scheme • Tracking • Method of Steepest Descent • Stable convergence behavior • Same Performance TI Meeting

Comparison of Bit Error Rates (BER) -1 10 -2 BER 10 O(K2N) MF ActMF ML ActML O(K3+K2N) -3 10 4 5 6 7 8 9 10 11 12 Signal to Noise Ratio (SNR) Simulations - AWGN Channel Detection Window = 12 SINR = 0 Paths =3 Preamble L =150 Spreading N = 31 Users K = 15 10000 bits/user MF – Matched Filter ML- Maximum Likelihood ACT – using inversion TI Meeting

0 10 MF - Static MF - Tracking ML - Static ML - Tracking -1 10 BER -2 10 -3 10 4 5 6 7 8 9 10 11 12 SNR Fading Channel with Tracking Doppler = 10 Hz, 1000 Bits,15 users, 3 Paths TI Meeting

Matched Filter 1 12 Stage 1 1 12 Stage 2 1 12 Stage 3 1 12 Matched Filter Bits 2-11 11 22 Stage 1 11 22 Stage 2 11 22 Stage 3 11 22 Bits 12-21 Block Based Detector TI Meeting

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Pipelined Detector Matched Filter 1 2 3 4 5 6 7 8 9 10 11 12 Stage 1 Stage 2 Stage 3 TI Meeting

Task Decomposition [Asilomar99] Block I Block III Block II Multistage Detector Correlation Matrices (Per Bit) Inverse Matrix Products Block IV M UX d A0HA1 O(K2N) Multistage Detection (Per Window) RbbAH = Rbr[R] O(K2N) Rbr[R] O(KN) b A0HA0 O(K2N) Rbr[I] O(KN) M UX Data’ RbbAH = Rbr[I] O(K2N) d O(DK2Me) Rbb O(K2) A1HA1 O(K2N) Pilot AHr O(KND) Data Channel Estimation Matched Filter TI Meeting

5 x 10 Data Rates for Different Levels of Pipelining and Parallelism 3 2.5 (Parallel A) (Parallel+Pipe B) (Parallel A) (Pipe B) (Parallel A) B 2 A B Sequential A + B Data Rates 1.5 Data Rate Requirement = 128 Kbps 1 0.5 0 9 10 11 12 13 14 15 Number of Users Achieved Data Rates TI Meeting

VLSI Implementation • Channel Estimation as a Case Study • Area - Time Efficient Architecture • Real - Time Implementation • Bit- Level Computations - FPGAs • Core Operations - DSPs TI Meeting

Motivation for Architecture • Wireless, the next wave after Multimedia • Highly Compute-Intensive Algorithms • Real-Time Requirements TI Meeting

Outline • Processor Core with Reconfigurable Support • Permutation Based Interleaved Memory • Processor Architecture -EPIC • Instruction Set Extensions • Truncated Multipliers • Software Support Needed TI Meeting

Characteristics of Wireless Algorithms • Massive Parallelism • Bit-level Computations • Matrix Based Operations • Memory Intensive • Complex-valued Data • Approximate Computations TI Meeting

What’s wrong with Current Architectures for these applications? TI Meeting

Problems with Current Architectures • UltraSPARC, C6x, MMX, IA-64 • Not enough MIPs/FLOPs • Unable to fully exploit parallelism • Bit Level Computations • Memory Bottlenecks • Specialized Instructions for Wireless Communications TI Meeting

Home Area Wireless LAN Outdoor CDMA Cellular Network High Speed Office Wireless LAN Why Reconfigurable • Adapt algorithms to environment • Seamless and Continuous Data Processing during Handoffs TI Meeting

User Interface Translation Synchronization Transport Network OSI Layers 3-7 Data Link Layer (Converts Frames to Bits) OSI Layer 2 Physical Layer (hardware; raw bit stream) OSI Layer 1 Reconfigurable Support TI Meeting

Different Protocols • MPEG-4, H.723 - Voice,Multimedia • Convolutional,Turbo - Channel Coding Source Coding Channel Coding Source Decoding Channel Decoding Multiuser Detection Channel Estimation TI Meeting

A New Architecture Main Memory Processor Core (GPP/DSP) Cache Q Q Crossbar Real-Time I/O Bit Stream Reconfigurable Logic RF Unit Add-on PCMCIA Card Processor TI Meeting

Why Reconfigurable • Process initial bit level computations • Optimize for fast I/O transfer Real-Time I/O Bit Stream Reconfigurable Logic RF Unit TI Meeting

Reconfigurable Support 2 64-bit data buses 1 64-bit address bus Control Blocks Boolean values Fast I/O Configuration Caches 64-bit Datapath Sequencer GARP Architecture at UC,Berkeley TI Meeting

Reconfigurable Support • Wide Path to Memory • Data Transfer • Minimize Load Times • Configuration Caches • Recently Displaced Configurations(5 cycles) • Can hold 4 full size Configurations • Independent Execution TI Meeting

Reconfigurable Support • Access to same Memory System as Processor • Minimize overhead • When idle • Load Configurations • Transfer Data TI Meeting

Instruction Cache Processor Core (GPP/DSP) L1 Data Cache Main Memory Q Q Crossbar FPGA Memory Interface • Access to Main Memory and L1 Data Cache • Large, fast Memory Store • Memory Prefetch Queues for Sequential Accesses • Read aheads and Write Behinds TI Meeting

Permutation Based Interleaved Memory (PBI) • High Memory Bandwidth Needed • Stride-Insensitive Memory System for Matrices • Multiple Banks • Sustained Peak Throughput (95%) Main Memory L1 Data Cache TI Meeting

Processor Core (GPP/DSP) Cache Q Q Crossbar FPGA Processor Core • 64-bit EPIC Architecture with Extensions(IA-64/C6x) • Statically determined Parallelism;exploit ILP • Execution Time Predictability TI Meeting

EPIC Principle • Explicitly Parallel Instruction Computing • Evolution of VLIW Computing • Compiler- Key role • Architecture to assist Compiler • Better cope with dynamic factors • which limited VLIW Parallelism TI Meeting

Instruction Set Extensions • To accelerate Bit level computations in Wireless • Real/Complex Integer - Bit Multiplications • Used in Multiuser Detection, Decoding • Bit - Bit Multiplications • Used in Outer Product Updates • Correlation, Channel Estimation • Complex Integer-Integer Multiplications • Useful in other Signal Processing applications • Speech, Video,,, TI Meeting

Architecture Support • Support via Instruction Set Extensions • Minimal ALU Modifications necessary • Transparent to Register Files/Memory • Additional 8-bit Special Purpose Registers TI Meeting

Integer - Bit Multiplications D = D + b*C Eg: Cross-Correlation 64-bit Register C 64-bit Register A +/- +/- +/- 8-bit Register b 64-bit Register D Register Renaming? TI Meeting

b(1) b(2) b(7) b(8) 8-bit to 64-bit conversions 1.2 1.1 D = D + b*bT Eg: Auto-Correlation 2.1 b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8) 8-bit Register b 64-bit Register A b(1)..b(8) b(1)..b(8) b(1) b(1) b(8) b(8) TI Meeting

Bit-Bit Multiplications D = D + b*bT Eg: Auto-Correlation b1*b2 Bit-Bit Multiplications 64-bit Register A = b1 64-bit Register B=b2 Ex-NOR 64-bit Register C=b1*b2 TI Meeting

Increment/Decrement D = D + b*bT Eg: Auto-Correlation 64-bit Register D 1 +/- +/- +/- 8-bit Register b1*b2 64-bit Register (D+b1*b2) TI Meeting

Complex-valued Data Processing • Is it easy to add ? • Is this worth an additional ALU Support ? • Typically supported by Software! ? TI Meeting

ALU Multipliers Truncated Multiplier Multiplier 1 Multiplier 2 Truncated Multipliers • Many applications need approximate computations • Adaptive Algorithms :Y = Y + mu*(Y*C) • Truncate lower bits • Truncated Multipliers - half the area/half the delay • Can do 2 truncated multiplies in parallel with regular TI Meeting

Software Support • Greater Interaction between Compilers and Architectures • EPIC • Reconfigurable Logic • Compiler needs to find and exploit bit level computations • Reconfigurable Logic Programming TI Meeting

Other Uses • Reconfigurable Logic • For accelerating loops of general purpose processors • Bit Level Support • For other voice, video and multimedia applications TI Meeting

Algorithms and Architecture s for Future Wireless Base-Stations

Algorithms and Architecture s for Future Wireless Base-Stations

Presentation Transcript

Software-Defined Radio Base Stations for Voice

NTP Architecture, Protocol and Algorithms

Algorithms for Wireless Sensor Networks

ACIDS AND BASE S

Cablin g and connectors for future LAV stations

Algorithms and Architecture

Baseband Architecture Design for Future Wireless Base-Station Receivers

Architectures and Algorithms for Future Wireless Local Area Networks

TSM Base Case Algorithms

Programmable processors for wireless base-stations

Architecture for Pattern-Base Management Systems

PON Architecture for Wireless Backhaul

DSPs for Future Wireless Base-Stations

Architecture and Algorithms for an IEEE 802.11-based Multi-channel Wireless Mesh Network

Algorithms for Wireless Sensor Networks

Architectures for Baseband Processing in Future Wireless Base-Station Receivers

Programmable processors for wireless base-stations

On the Connectivity of Finite Wireless Networks with Multiple Base Stations

Programmable processors for wireless base-stations

How To Setup And Configure An Extended Wireless Network In Airport Base Stations?

TSM Base Case Algorithms

NTP Architecture, Protocol and Algorithms