390 likes | 494 Views
Wireless Communication Extensions for DSPs and General Purpose Processors. Sridhar Rajagopal COMP 625 April 17, 2000. Motivation . Wireless, the next wave after Multimedia Highly Compute-Intensive Algorithms Real-Time Requirements Design based on Time-to-Market. Outline .
E N D
Wireless Communication Extensions for DSPs and General Purpose Processors Sridhar Rajagopal COMP 625 April 17, 2000
Motivation • Wireless, the next wave after Multimedia • Highly Compute-Intensive Algorithms • Real-Time Requirements • Design based on Time-to-Market Sridhar Rajagopal
Outline • Processor Core with Reconfigurable Support • Permutation Based Interleaved Memory • Processor Architecture -EPIC • Instruction Set Extensions • Truncated Multipliers • Software Support Needed Sridhar Rajagopal
Characteristics of Wireless Algorithms • Massive Parallelism • Bit-level Computations • Matrix Based Operations • Memory Intensive • Complex-valued Data • Approximate Computations Sridhar Rajagopal
What’s wrong with Current Architectures for these applications? Sridhar Rajagopal
Problems with Current Architectures • UltraSPARC, C6x, MMX, IA-64 • Not enough MIPs/FLOPs • Unable to fully exploit parallelism • Bit Level Computations • Memory Bottlenecks • Specialized Instructions for Wireless Communications Sridhar Rajagopal
Home Area Wireless LAN Outdoor CDMA Cellular Network High Speed Office Wireless LAN Why Reconfigurable • Adapt algorithms to environment • Seamless and Continuous Data Processing during Handoffs Sridhar Rajagopal
User Interface Translation Synchronization Transport Network OSI Layers 3-7 Data Link Layer (Converts Frames to Bits) OSI Layer 2 Physical Layer (hardware; raw bit stream) OSI Layer 1 Reconfigurable Support Sridhar Rajagopal
Different Protocols • MPEG-4, H.723 - Voice,Multimedia • Convolutional,Turbo - Channel Coding Source Coding Channel Coding Source Decoding Channel Decoding Multiuser Detection Channel Estimation Sridhar Rajagopal
A New Architecture Main Memory Processor Core (GPP/DSP) Cache Q Q Crossbar Real-Time I/O Bit Stream Reconfigurable Logic RF Unit Add-on PCMCIA Network Interface Card Processor Sridhar Rajagopal
Why Reconfigurable • Process initial bit level computations • Optimize for fast I/O transfer Real-Time I/O Bit Stream Reconfigurable Logic RF Unit Sridhar Rajagopal
Reconfigurable Support 2 64-bit data buses 1 64-bit address bus Control Blocks Boolean values Fast I/O Configuration Caches 64-bit Datapath Sequencer GARP Architecture at UC,Berkeley Sridhar Rajagopal
Reconfigurable Support • Wide Path to Memory • Data Transfer • Minimize Load Times • Configuration Caches • Recently Displaced Configurations(5 cycles) • Can hold 4 full size Configurations • Independent Execution Sridhar Rajagopal
Reconfigurable Support • Access to same Memory System as Processor • Minimize overhead • When idle • Load Configurations • Transfer Data Sridhar Rajagopal
Operation • Load Configuration • If in configuration cache, minimal time • Copy initial data with coprocessor move instructions • Start execution • Issue wait that interlocks while active • Copy registers back at kernel completion Sridhar Rajagopal
Instruction Cache Processor Core (GPP/DSP) L1 Data Cache Main Memory Q Q Crossbar FPGA Memory Interface • Access to Main Memory and L1 Data Cache • Large, fast Memory Store • Memory Prefetch Queues for Sequential Accesses • Read aheads and Write Behinds Sridhar Rajagopal
Permutation Based Interleaved Memory (PBI) • High Memory Bandwidth Needed • Stride-Insensitive Memory System for Matrices • Multiple Banks • Sustained Peak Throughput (95%) Main Memory L1 Data Cache Sridhar Rajagopal
PBI Scheme • N- address length • M = 2n Banks • 2N-n words in each bank • To access a word, • n-bit bank number • N-n bit address (high-order) • Calculation of the n-bit Bank Number Sridhar Rajagopal
N-bit address Parity Ckt. Row 1 of A Parity Ckt. Row n-1 of A Parity Ckt. Row 0 of A n parity bit signals Decoder 2n bank select signals Calculate Bank Number • Use all N bits to get n-bit vector • Y = A X , A = n*N matrix of 0’s & 1’s • Y = AhXh + Al Xl (N-n,n) [Al -rank n] • N-bit parity circuit with logkN levels of XOR gates (k-Fanin) Sridhar Rajagopal
Interleaved Memory Model Input Buffers Address Source Memory Banks M(0) M(1) M(M-1) Data Sink Data Sequencer Output Buffers Sridhar Rajagopal
Processor Core (GPP/DSP) Cache Q Q Crossbar FPGA Processor Core • 64-bit EPIC Architecture with Extensions(IA-64/C6x) • Statically determined Parallelism;exploit ILP • Execution Time Predictability Sridhar Rajagopal
EPIC Principle • Explicitly Parallel Instruction Computing • Evolution of VLIW Computing • Compiler- Key role • Architecture to assist Compiler • Better cope with dynamic factors • which limited VLIW Parallelism Sridhar Rajagopal
Aspects of EPIC • Designing Plan of Execution(POE) at Compile Time • Permitting Compiler to play Statistics • Conditional Branches, Memory references • Communicating POE to the hardware • Static Scheduling • Branch information Sridhar Rajagopal
Architecture Features in EPIC • Static Scheduling • MultiOP • Non-Unit Assumed Latency (NUAL) • The Branch Problem • Predicated Execution • Control Speculation • Predicated Code Motion • The Memory Problem • Cache Specifiers • Data Speculation Sridhar Rajagopal
Instruction Set Extensions • To accelerate Bit level computations in Wireless • Real/Complex Integer - Bit Multiplications • Used in Multiuser Detection, Decoding • Bit - Bit Multiplications • Used in Outer Product Updates • Correlation, Channel Estimation • Complex Integer-Integer Multiplications • Useful in other Signal Processing applications • Speech, Video,,, Sridhar Rajagopal
Architecture Support • Support via Instruction Set Extensions • Minimal ALU Modifications necessary • Transparent to Register Files/Memory • Additional 8-bit Special Purpose Registers Sridhar Rajagopal
Integer - Bit Multiplications D[I] = D[I] + b[J]*C[j] Eg: Cross-Correlation 64-bit Register C 64-bit Register A +/- +/- +/- 8-bit Register b 64-bit Register D Register Renaming? Sridhar Rajagopal
b(1) b(2) b(7) b(8) 8-bit to 64-bit conversions 1.2 1.1 D = D + b*bT Eg: Auto-Correlation 2.1 b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8) 8-bit Register b 64-bit Register A b(1)..b(8) b(1)..b(8) b(1) b(1) b(8) b(8) Sridhar Rajagopal
Bit-Bit Multiplications D = D + b*bT Eg: Auto-Correlation b1*b2 Bit-Bit Multiplications 64-bit Register A = b1 64-bit Register B=b2 Ex-NOR 64-bit Register C=b1*b2 Sridhar Rajagopal
Increment/Decrement D = D + b*bT Eg: Auto-Correlation 64-bit Register D 1 +/- +/- +/- 8-bit Register b1*b2 64-bit Register (D+b1*b2) Sridhar Rajagopal
Complex-valued Data Processing • Is it easy to add ? • Is this worth an additional ALU Support ? • Typically supported by Software! ? Sridhar Rajagopal
Truncated Multipliers • Many applications need approximate computations • Adaptive Algorithms :Y = Y + mu*(Y*C) • Truncate lower bits • Truncated Multipliers - half the area/half the delay • Can do 2 truncated multiplies in parallel with regular ALU Multipliers Truncated Multiplier Multiplier 1 Multiplier 2 Sridhar Rajagopal
Software Support • Greater Interaction between Compilers and Architectures • EPIC • Reconfigurable Logic • Compiler needs to find and exploit bit level computations • Reconfigurable Logic Programming Sridhar Rajagopal
Area Estimates • Area increase by 20% over a IA-64 architecture size due to reconfigurable Support • Instruction Set extensions need min hardware support • Parallel Interleaved Memory Banks will need larger area Sridhar Rajagopal
Other Uses • Reconfigurable Logic • For accelerating loops of general purpose processors • Bit Level Support • For other voice, video and multimedia applications Sridhar Rajagopal
Conclusions • Processor Core with Reconfigurable Support developed for Wireless Applications • Instruction Set Extensions added for accelerating performance of the algorithms • Integration of Wireless Appliances with General Purpose Processors • Great Impact on Performance of Wireless Algorithms Sridhar Rajagopal
Future Work • Simulations for finding performance improvements • Other Processor Architectures • Bit Slice Architectures • Out-of-order Sridhar Rajagopal
References • The GARP Architecture and C Compiler • T.C. Callahan,J.R.Hauser,J.Wawrzynek, IEEE Computer,April 2000, pp62-69 • http://brass.cs.berkeley.edu • EPIC:Explicitly Parallel Instruction Computing • M.S.Schlansker,B.R.Rau, IEEE Computer, Feb 2000, pp 37-45 • High-Bandwidth Interleaved Memories for Vector Processors - A Simulation Study • G.S.Sohi, IEEE Transactions on Computers, Vol.42,No.1,Jan 1993,pp34-44 Sridhar Rajagopal
Acknowledgements • Vijay Pai • Partha Ranganathan • Joseph Cavallaro Sridhar Rajagopal