A Programmable Coprocessor Architecture for Wireless Applications

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture Lab University of Michigan Sept. 2004

Introduction • Growing need to support multiple wireless protocols • Software defined radio: implementing DSP algorithms in software rather than hardware • ASIC: high performance, low flexibility • Processor: high flexibility, low performance • Objective: achieve real time performance with processor flexibility and programmability

Performance Requirements UWB 200Mbps Hiperlan2 36Mbps 802.11b 11Mbps

DSP Algorithms Characteristics • Streaming data • Short variable liveness • High data throughput • High data level parallelism • Low control flow overhead • Counted loops • Low data-dependent branches

Proposed Coprocessor Architecture: MAPP • Stream Data • Macro pipeline architecture • No cache structure • High Data Level Parallelism • Vector architecture • Low Control Flow Overhead • No branch predictors • Programmability to support multiple protocols

MAPP Architectural Diagram ARM Core Instruction Cache VPP Controller Vector Processing Pipeline Data Cache PPU PPU PPU

PPU Architectural Diagram Pipeline Processing Unit VPP Controller Vector Register File Data Out Vector ALU Out Queue Data In In Queue VPP Controller Internal Instruction Buffer

Mapping DSP Algorithms: Viterbi ACS bm1 s1 s0 bm0 v0 0 4 8 8 2 8 0 4 4 8 4 8 2 4 8 2 v1 mask l l g e e g l g 0 4 8 2 0 0 4 4 8 2 4 0 2 s’ bm1 vadd v0, s0, bm0 S1 vadd v1, s1, bm1 cmp v0, v1 mux bm0 S’2 move{le} s’, v1 move{g} s’, v2 S0

Increase Area/Power Efficiency • Data slice architecture • Most DSP algorithms do not need 32-bit precision • Viterbi decoding operates on 8 bits data • Filters may need 16 bit precisions • Partial processor execution • Statically determined code • Turn off architecture units not used • Energy saving, no area saving

Vector Cluster Diagram (4x8 bit data slice) In Queue Register File ALU Out Q. In Queue Register File ALU Out Q. 4x4 Local Interconnect Network In Queue Register File ALU Out Q. In Queue Register File ALU Out Q.

Performance Results

Simplistic Power Analysis • Based on ARM9 data in 0.13u • Viterbi Decoder (K=7): 0.75W ~ 1W • 64x4 8 bit ALU: ~240mW • 12KB Mem: ~310mW • Clock: ~200mW • Others: ~250mW • ASIC implementations: 7.65mW ~ 0.7W (with different throughputs)

Conclusion & Future Work • Programmable coprocessor architecture • Can support multiple protocols • Achieves real-time computational requirements • Reasonable power consumptions • Future work • Realistic power model simulation • Implement complete protocols • Algorithm behavior studies • Shrink processor area

A Programmable Coprocessor Architecture for Wireless Applications

A Programmable Coprocessor Architecture for Wireless Applications

Presentation Transcript

Wireless Applications

DART: A Programmable Architecture for NoC Simulation on FPGAs

A Programmable Wireless Sensing System for Structural Monitoring

At-scale Programmable Wireless Testbeds

Dynamically Programmable Array Architecture

Physics Applications Online Architecture FPGA Coprocessor HLT Communication

FPGA-Based Wireless Sensor Network Architecture for High Performance Applications

Dynamically Programmable Array Architecture

Programmable processors for wireless base-stations

A Signaling Architecture for All IP Wireless Networks

Architecture Description Languages for Programmable Embedded Systems

A Grid Architecture for Medical Applications

PON Architecture for Wireless Backhaul

OpenRadio A programmable wireless dataplane

Architectures and Applications for Wireless Sensor Networks (01204525) Network Architecture

Math Coprocessor

Security Architecture for GRID Applications

DISE: A Programmable Macro Engine for Customizing Applications

Programmable processors for wireless base-stations

Programmable processors for wireless base-stations

Math Coprocessor