SWAP : S treaming W ireless A pplication-specific P rocessors

SWAP: StreamingWirelessApplication-specificProcessors Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro {sridhar,rixner,cavallar}@rice.edu

New challenges in designing wireless systems • Flexibility • Fast evaluation and • Adapting architectures for emerging systems

GPP Implementation of Wireless Devices DSP Time FPGA ASIC Traditional challenge: Primary constraint: min. area, power and real-time Secondary constraint: flexibility, evaluation, adaptation New challenge: Primary constraint: flexibility, evaluation, adaptation Secondary constraint: min. area, power and real-time

SWAP • Media processors – recent trend in DSP architectures • Explore space of stream processors with isim-astream processor simulator • Based on the IMAGINE architecture from MIT/Stanford • Swap existing ASIC/FPGA/DSP baseband architectures with Streaming Wireless Application-specific Processors

Outline • Designing SWAP • Swapping SWAP

A proposed cellular receiver • High data rates in emerging wireless systems (Mbps/user) • Sophisticated algorithms for high spectral efficiency • Multiuser estimation, multiuser detection, Viterbi K = 1 => single user system (handset) (multipath)

Multiuser Estimation Prepare Matrices for Detection Multiuser Detection Estimation/Detection (64,32 sizes)

Stream programming Kernels • Computation KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } } Streams • Communication void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }

Estimation bits Correlation Multiuser Channel Estimation update kernel Matrix mult Computation kernel Iteration update kernel Communication Detection Data rearrangement bits Matrix Multiuser Detection Matrix mul Matched Matrix mul transpose C kernel filter kernel L kernel PIC kernel Buffer Viterbi Decoding kernel Stream data flow

Matrix multiplication kernel (Imagine) Instruction ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 • 32 cycle loop • Executed on all 8 clusters Communication (waiting for input) FU unavailable (input ready but FU busy) Inner Loop

22 cycle loop Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2

Cycle Kernel Execution Memory Transfers Stalled waiting for data from memory

Current architecture designs • Memory stalls ignored • 16 Gbps memory systems in the future • Functional unit utilization ignored • Not important for execution time • Important metrics for wireless systems • Because of POWER constraints

SDRAM SDRAM SDRAM SDRAM Streaming Memory System Network Stream Host Stream Register File Interface Controller Processor Network Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine architecture

Exploring design space • Cluster limits : data-parallelism • More parallelism in data => more clusters • FU limits : dependencies/VLIW scheduling •  2/type to pipeline dependencies •  5/type as difficult for compiler to schedule. • Physical limits: 128 clusters with 10 ALUs

Architecture design criteria

SWAP Base-stations SWAP Handsets SDRAM SDRAM Streaming Memory System SMS SRF Stream Register File + + + * * * MC + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * Micro- controller … 1 ALU Cluster 32 ALU Clusters

Outline • Designing SWAP • Swapping SWAP • Real-time for emerging systems

Real-time/ (area/power) Operations requirements count Architecture design scaling (# Functional units, # clusters) New architecture parameters NO Fabrication Design failed feasible? YES Compile on new architecture Adapting architectures for emerging systems Test new algorithm on end-to-end system Compile on existing architecture Done YES Real-time/area/power Re-design algorithms/ satisfied? NO architecture

Automated adaptation tool If new algorithms similar and reasonable complexity, real time with no changes in architecture else an automated tool scales the architecture while simultaneously targeting real-time and FU utilization.

Algorithm constraints • 1 < #FU < 6 • #clusters = 1,2,4,8, ... 128 • Finite space • Exhaustive search for real-time with an “efficiency” metric

3 10 FAST FADING (estimation every 10 bits) MEDIUM FADING (estimation every 100 bits) 2 10 ALUs required for real-time at 500 MHz SLOW FADING (estimation every 1000 bits) 1 10 Add Multiply 0 10 0 50 100 150 200 250 300 Number of W-CDMA Cellular Users Can we do smarter ? #Adders = # Multipliers

Algorithm outline • Exploit data-parallelism AMAP • Clusters more energy-efficient • Look at FU utilizations(FUU) in current architecture • (max %FUU)++ • Bottlenecking other units • Make clock slower for real-time

Conclusions New metrics for SWAP-ing • functional unit efficiency and memory stall minimization • relate to area-time-power metrics in ASICs. Tradeoffs exist between • attaining high functional unit efficiency and minimizing memory stalls • writing architecture-scalable code and attaining higher functional unit efficiency

Future work for thesis • Comparisons with DSPs and ASICs • Investigating new inter-cluster communication and support for data re-ordering on-chip • Automated tool for scaling architecture with algorithms and data rates. • Power optimizations for handset architectures

Kernel computational time

SWAP : S treaming W ireless A pplication-specific P rocessors

SWAP : S treaming W ireless A pplication-specific P rocessors

Presentation Transcript

The Answer to Free Memory, Swap, Oracle and everything

Venturing Leader Specific Training

Specific Learning Disabilities: Eligibility Determination under IDEA 2004 Special Education Planning and Policy Develop

Data Dissemination Protocols in Wireless Sensor Networks : Models, Security and Design

Specific Contracts

Memory Management Policies: UNIX

Classes of Iymphocytes B lymphocytes recognize soluble antigens and develop into antibody-secreting cells

The Central Dogma of Life.

Determination of Specific Heat for a Metal

ABBRIVIATED NEW DRUG PPLICATION (ANDA)

The Sporting Group

Software Development Methodologies

98-372 Microsoft .NET Fundamentals

The Chromosomal Basis of Inheritance

Abstract Data Types and Stacks

ACCESS CONTROL

Prosopagnosia and Face-Specific Mechanisms

Part 2 – Exotic swap products Asset swaps Total return swaps Forward swaps

Tarheel Consultancy Services

Chapter 17

Venturing Leader Specific Training

SCW UNIT SPECIFIC