260 likes | 409 Views
SWAP : S treaming W ireless A pplication-specific P rocessors. Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro {sridhar,rixner,cavallar}@rice.edu. New challenges in designing wireless systems Flexibility Fast evaluation and Adapting architectures for emerging systems. GPP.
E N D
SWAP: StreamingWirelessApplication-specificProcessors Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro {sridhar,rixner,cavallar}@rice.edu
New challenges in designing wireless systems • Flexibility • Fast evaluation and • Adapting architectures for emerging systems
GPP Implementation of Wireless Devices DSP Time FPGA ASIC Traditional challenge: Primary constraint: min. area, power and real-time Secondary constraint: flexibility, evaluation, adaptation New challenge: Primary constraint: flexibility, evaluation, adaptation Secondary constraint: min. area, power and real-time
SWAP • Media processors – recent trend in DSP architectures • Explore space of stream processors with isim-astream processor simulator • Based on the IMAGINE architecture from MIT/Stanford • Swap existing ASIC/FPGA/DSP baseband architectures with Streaming Wireless Application-specific Processors
Outline • Designing SWAP • Swapping SWAP
A proposed cellular receiver • High data rates in emerging wireless systems (Mbps/user) • Sophisticated algorithms for high spectral efficiency • Multiuser estimation, multiuser detection, Viterbi K = 1 => single user system (handset) (multipath)
Multiuser Estimation Prepare Matrices for Detection Multiuser Detection Estimation/Detection (64,32 sizes)
Stream programming Kernels • Computation KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } } Streams • Communication void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }
Estimation bits Correlation Multiuser Channel Estimation update kernel Matrix mult Computation kernel Iteration update kernel Communication Detection Data rearrangement bits Matrix Multiuser Detection Matrix mul Matched Matrix mul transpose C kernel filter kernel L kernel PIC kernel Buffer Viterbi Decoding kernel Stream data flow
Matrix multiplication kernel (Imagine) Instruction ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 • 32 cycle loop • Executed on all 8 clusters Communication (waiting for input) FU unavailable (input ready but FU busy) Inner Loop
22 cycle loop Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2
Cycle Kernel Execution Memory Transfers Stalled waiting for data from memory
Current architecture designs • Memory stalls ignored • 16 Gbps memory systems in the future • Functional unit utilization ignored • Not important for execution time • Important metrics for wireless systems • Because of POWER constraints
SDRAM SDRAM SDRAM SDRAM Streaming Memory System Network Stream Host Stream Register File Interface Controller Processor Network Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine architecture
Exploring design space • Cluster limits : data-parallelism • More parallelism in data => more clusters • FU limits : dependencies/VLIW scheduling • 2/type to pipeline dependencies • 5/type as difficult for compiler to schedule. • Physical limits: 128 clusters with 10 ALUs
SWAP Base-stations SWAP Handsets SDRAM SDRAM Streaming Memory System SMS SRF Stream Register File + + + * * * MC + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * Micro- controller … 1 ALU Cluster 32 ALU Clusters
Outline • Designing SWAP • Swapping SWAP • Real-time for emerging systems
Real-time/ (area/power) Operations requirements count Architecture design scaling (# Functional units, # clusters) New architecture parameters NO Fabrication Design failed feasible? YES Compile on new architecture Adapting architectures for emerging systems Test new algorithm on end-to-end system Compile on existing architecture Done YES Real-time/area/power Re-design algorithms/ satisfied? NO architecture
Automated adaptation tool If new algorithms similar and reasonable complexity, real time with no changes in architecture else an automated tool scales the architecture while simultaneously targeting real-time and FU utilization.
Algorithm constraints • 1 < #FU < 6 • #clusters = 1,2,4,8, ... 128 • Finite space • Exhaustive search for real-time with an “efficiency” metric
3 10 FAST FADING (estimation every 10 bits) MEDIUM FADING (estimation every 100 bits) 2 10 ALUs required for real-time at 500 MHz SLOW FADING (estimation every 1000 bits) 1 10 Add Multiply 0 10 0 50 100 150 200 250 300 Number of W-CDMA Cellular Users Can we do smarter ? #Adders = # Multipliers
Algorithm outline • Exploit data-parallelism AMAP • Clusters more energy-efficient • Look at FU utilizations(FUU) in current architecture • (max %FUU)++ • Bottlenecking other units • Make clock slower for real-time
Conclusions New metrics for SWAP-ing • functional unit efficiency and memory stall minimization • relate to area-time-power metrics in ASICs. Tradeoffs exist between • attaining high functional unit efficiency and minimizing memory stalls • writing architecture-scalable code and attaining higher functional unit efficiency
Future work for thesis • Comparisons with DSPs and ASICs • Investigating new inter-cluster communication and support for data re-ordering on-chip • Automated tool for scaling architecture with algorithms and data rates. • Power optimizations for handset architectures