1 / 26

SWAP : S treaming W ireless A pplication-specific P rocessors

SWAP : S treaming W ireless A pplication-specific P rocessors. Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro {sridhar,rixner,cavallar}@rice.edu. New challenges in designing wireless systems Flexibility Fast evaluation and Adapting architectures for emerging systems. GPP.

addison
Download Presentation

SWAP : S treaming W ireless A pplication-specific P rocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SWAP: StreamingWirelessApplication-specificProcessors Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro {sridhar,rixner,cavallar}@rice.edu

  2. New challenges in designing wireless systems • Flexibility • Fast evaluation and • Adapting architectures for emerging systems

  3. GPP Implementation of Wireless Devices DSP Time FPGA ASIC Traditional challenge: Primary constraint: min. area, power and real-time Secondary constraint: flexibility, evaluation, adaptation New challenge: Primary constraint: flexibility, evaluation, adaptation Secondary constraint: min. area, power and real-time

  4. SWAP • Media processors – recent trend in DSP architectures • Explore space of stream processors with isim-astream processor simulator • Based on the IMAGINE architecture from MIT/Stanford • Swap existing ASIC/FPGA/DSP baseband architectures with Streaming Wireless Application-specific Processors

  5. Outline • Designing SWAP • Swapping SWAP

  6. A proposed cellular receiver • High data rates in emerging wireless systems (Mbps/user) • Sophisticated algorithms for high spectral efficiency • Multiuser estimation, multiuser detection, Viterbi K = 1 => single user system (handset) (multipath)

  7. Multiuser Estimation Prepare Matrices for Detection Multiuser Detection Estimation/Detection (64,32 sizes)

  8. Stream programming Kernels • Computation KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } } Streams • Communication void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }

  9. Estimation bits Correlation Multiuser Channel Estimation update kernel Matrix mult Computation kernel Iteration update kernel Communication Detection Data rearrangement bits Matrix Multiuser Detection Matrix mul Matched Matrix mul transpose C kernel filter kernel L kernel PIC kernel Buffer Viterbi Decoding kernel Stream data flow

  10. Matrix multiplication kernel (Imagine) Instruction ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 • 32 cycle loop • Executed on all 8 clusters Communication (waiting for input) FU unavailable (input ready but FU busy) Inner Loop

  11. 22 cycle loop Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2

  12. Cycle Kernel Execution Memory Transfers Stalled waiting for data from memory

  13. Current architecture designs • Memory stalls ignored • 16 Gbps memory systems in the future • Functional unit utilization ignored • Not important for execution time • Important metrics for wireless systems • Because of POWER constraints

  14. SDRAM SDRAM SDRAM SDRAM Streaming Memory System Network Stream Host Stream Register File Interface Controller Processor Network Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine architecture

  15. Exploring design space • Cluster limits : data-parallelism • More parallelism in data => more clusters • FU limits : dependencies/VLIW scheduling •  2/type to pipeline dependencies •  5/type as difficult for compiler to schedule. • Physical limits: 128 clusters with 10 ALUs

  16. Architecture design criteria

  17. SWAP Base-stations SWAP Handsets SDRAM SDRAM Streaming Memory System SMS SRF Stream Register File + + + * * * MC + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * Micro- controller … 1 ALU Cluster 32 ALU Clusters

  18. Outline • Designing SWAP • Swapping SWAP • Real-time for emerging systems

  19. Real-time/ (area/power) Operations requirements count Architecture design scaling (# Functional units, # clusters) New architecture parameters NO Fabrication Design failed feasible? YES Compile on new architecture Adapting architectures for emerging systems Test new algorithm on end-to-end system Compile on existing architecture Done YES Real-time/area/power Re-design algorithms/ satisfied? NO architecture

  20. Automated adaptation tool If new algorithms similar and reasonable complexity, real time with no changes in architecture else an automated tool scales the architecture while simultaneously targeting real-time and FU utilization.

  21. Algorithm constraints • 1 < #FU < 6 • #clusters = 1,2,4,8, ... 128 • Finite space • Exhaustive search for real-time with an “efficiency” metric

  22. 3 10 FAST FADING (estimation every 10 bits) MEDIUM FADING (estimation every 100 bits) 2 10 ALUs required for real-time at 500 MHz SLOW FADING (estimation every 1000 bits) 1 10 Add Multiply 0 10 0 50 100 150 200 250 300 Number of W-CDMA Cellular Users Can we do smarter ? #Adders = # Multipliers

  23. Algorithm outline • Exploit data-parallelism AMAP • Clusters more energy-efficient • Look at FU utilizations(FUU) in current architecture • (max %FUU)++ • Bottlenecking other units • Make clock slower for real-time

  24. Conclusions New metrics for SWAP-ing • functional unit efficiency and memory stall minimization • relate to area-time-power metrics in ASICs. Tradeoffs exist between • attaining high functional unit efficiency and minimizing memory stalls • writing architecture-scalable code and attaining higher functional unit efficiency

  25. Future work for thesis • Comparisons with DSPs and ASICs • Investigating new inter-cluster communication and support for data re-ordering on-chip • Automated tool for scaling architecture with algorithms and data rates. • Power optimizations for handset architectures

  26. Kernel computational time

More Related