130 likes | 282 Views
A Programmable Coprocessor Architecture for Wireless Applications. Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture Lab University of Michigan Sept. 2004. Introduction. Growing need to support multiple wireless protocols
E N D
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture Lab University of Michigan Sept. 2004
Introduction • Growing need to support multiple wireless protocols • Software defined radio: implementing DSP algorithms in software rather than hardware • ASIC: high performance, low flexibility • Processor: high flexibility, low performance • Objective: achieve real time performance with processor flexibility and programmability
Performance Requirements UWB 200Mbps Hiperlan2 36Mbps 802.11b 11Mbps
DSP Algorithms Characteristics • Streaming data • Short variable liveness • High data throughput • High data level parallelism • Low control flow overhead • Counted loops • Low data-dependent branches
Proposed Coprocessor Architecture: MAPP • Stream Data • Macro pipeline architecture • No cache structure • High Data Level Parallelism • Vector architecture • Low Control Flow Overhead • No branch predictors • Programmability to support multiple protocols
MAPP Architectural Diagram ARM Core Instruction Cache VPP Controller Vector Processing Pipeline Data Cache PPU PPU PPU
PPU Architectural Diagram Pipeline Processing Unit VPP Controller Vector Register File Data Out Vector ALU Out Queue Data In In Queue VPP Controller Internal Instruction Buffer
Mapping DSP Algorithms: Viterbi ACS bm1 s1 s0 bm0 v0 0 4 8 8 2 8 0 4 4 8 4 8 2 4 8 2 v1 mask l l g e e g l g 0 4 8 2 0 0 4 4 8 2 4 0 2 s’ bm1 vadd v0, s0, bm0 S1 vadd v1, s1, bm1 cmp v0, v1 mux bm0 S’2 move{le} s’, v1 move{g} s’, v2 S0
Increase Area/Power Efficiency • Data slice architecture • Most DSP algorithms do not need 32-bit precision • Viterbi decoding operates on 8 bits data • Filters may need 16 bit precisions • Partial processor execution • Statically determined code • Turn off architecture units not used • Energy saving, no area saving
Vector Cluster Diagram (4x8 bit data slice) In Queue Register File ALU Out Q. In Queue Register File ALU Out Q. 4x4 Local Interconnect Network In Queue Register File ALU Out Q. In Queue Register File ALU Out Q.
Simplistic Power Analysis • Based on ARM9 data in 0.13u • Viterbi Decoder (K=7): 0.75W ~ 1W • 64x4 8 bit ALU: ~240mW • 12KB Mem: ~310mW • Clock: ~200mW • Others: ~250mW • ASIC implementations: 7.65mW ~ 0.7W (with different throughputs)
Conclusion & Future Work • Programmable coprocessor architecture • Can support multiple protocols • Achieves real-time computational requirements • Reasonable power consumptions • Future work • Realistic power model simulation • Implement complete protocols • Algorithm behavior studies • Shrink processor area