1 / 23

Hierarchical Coarse-grained Stream Compilation for Software Defined Radio

Hierarchical Coarse-grained Stream Compilation for Software Defined Radio. Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor. Software Defined Radio.

felcia
Download Presentation

Hierarchical Coarse-grained Stream Compilation for Software Defined Radio

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor

  2. Software Defined Radio • Use software routines instead of ASICs for the physical layer operations of wireless communication system • Advantages: • Multi-mode operation • Lower costs • Faster time to market • Prototyping and bug fixes • Chip volumes • Longevity of platforms • Enables future wireless communication innovations • Complexity favors software-based solutions

  3. Case Study: W-CDMA • Key software characteristics • Multiple kernels connected together as a system • Streaming computation • Vector-based inter-kernel communications • Mostly static computation patterns

  4. SODA: A SDR DSP Architecture (ISCA 06) • Control-data decoupled multi-core architecture • 1 ARM general purpose control processor • Scalar algorithms and protocol controls • 4 data processing elements • SIMD+Scalar units • Used for high-throughput DSP algorithms

  5. SODA Execution Model • Software managed scratchpad memories • Each PE can only access its local memory • DMA operations • Access global memory • Inter-PE communications • Algorithms statically mapped onto PEs • RPCs from the ARM control processor

  6. Compilation Challenges for SDR • Compilation support for SDR is essential • Flexibility • Lower development cost • More complex protocols • Compilation support for SDR is challenging • Heterogeneous multiprocessor hardware • ARM + DSPs • Two level scratchpad memories • Multiple software constraints • Throughput + code & data size + real-time execution + others

  7. 2-Tier Compilation Process • This study is focused on system compilation • Kernel compilation is treated as a black box • Existing libraries • SIMD compilers • Objective • Kernel-to-PE assignments • Memory allocations • Subject to • Throughput constraints • Memory constraints DSP kernel compilation Multiprocessor system compilation

  8. System Compilation Outline • SPIR – Function level IR • Traditional IR is not adequate • Complex inter-function interactions • Backend compilation • Scheduling functions instead of instructions • Function-level modulo scheduling

  9. SPIR Overview • Dataflow programming model • Graph consists of nodes and edges • Two types of nodes • Kernel (yellow) nodes for modeling functions • Memory (blue) nodes for modeling vector buffers • Buffer stream description + vector stream description • Dataflow edges • Synchronous dataflow (in the scope of this paper)

  10. SPIR Overview • Problems with flat dataflow graph representations • Matched to the highest rate • SDR kernels have very different stream rates • Turbo decoder: input rate = 9600; output rate = 3200 • LPF: input rate = 1; output rate = 1

  11. SPIR Overview • Problems with flat dataflow graph representations • All must match to 9600 of the Turbo decoder • Minimum LPF rate: input = 38.4K, output = 38.4K • Stream rates translate to memory buffers • Unnecessarily large memory buffers

  12. SPIR Overview • Hierarchical dataflow graphs • Different hierarchy level with different streaming rates • Streaming vectors are modeled as hierarchical communications • Top level: buffer queue descriptions • Bottom level: vector streaming descriptions

  13. SPIR Overview • W-CDMA • Modeled with 3-level hierarchy in SPIR • Memory nodes are inserted between nodes with child graph • 4x decrease in memory buffer usage

  14. Coarse-grained System Compilation • Three major tasks • Resource allocation (processor, memory and DMA) • Kernel execution ordering • Kernel execution timing • Static or dynamic? • Static – compiler • Less flexible, more efficient • Dynamic – run-time scheduler or OS • More flexible, less efficient • For SDR applications • Resource allocation: static • Kernel execution ordering: static • Kernel execution timing: dynamic

  15. Software Pipelining Streaming Kernels • Problem with coarse-grained compilation • Requires kernel-level parallelism to utilize the PEs • SDR protocols do not have many data-independent kernels • Compiler optimization: coarse-grained software pipelining • Stream computation: pipeline parallelism • Modulo scheduling

  16. Coarse-grained System Compilation • Input • Hierarchical graph • Step 1 • Dataflow rate matching • Step 2 • Stream size selection • Step 3 • Modulo scheduling • Step 4 • Hierarchical compilation Dataflow rate matching Stream size selection Modulo compilation Hierarchical scheduling

  17. Coarse-grained System Compilation • Step 1: Dataflow rate matching • Producer and consumer pair must have the same rates • Edges are memory buffers • Well studied with many existing algorithms • Single appearance schedule Dataflow rate matching

  18. Coarse-grained System Compilation • Step 2: Stream size selection • Pick optimal input/output buffer size • Multiple of the base rate • Binary search algorithm • Modulo schedule each candidate buffer size • Rate = 1, Streaming N elements • Case 1: N iterations • Too much DMA overhead • Case 2: 1 iteration • Cannot software pipeline • Case 3: N/M iterations Stream size selection

  19. Coarse-grained System Compilation • Step 3: Function-level modulo scheduling • II selection (Initiation Interval) • Interval between the start of successive iterations • MinII = Max(ResMII, RecMII) • ResMII: total latency of all nodes divided by # of PEs • RecMII: maximum latency of feedback paths • Constraint-based modulo scheduling • SMT-based algorithm Modulo compilation

  20. SMT-based Modulo Scheduling • Using Satisfiability Modulo Theory (SMT) solver Yices • Input: a set of constraints expressed as equations • Output: a set of conditions where the constraints evaluate to true • Constraints • Throughput constraints • i.e. total execution time must be less than or equal to II • Memory constraints • i.e. buffer size less than PE’s scratchpad memories • Communication constraints • i.e. DMA added for communicating kernels on different PEs number of kernels status of kernel vi assigned to processor j (1 or 0)

  21. Coarse-grained System Compilation • Step 4: Hierarchical scheduling • Bottom up scheduling • Treat each child graph as a single node • Memory nodes assigned to global memory Hierarchical scheduling

  22. Conclusion • Compilation support for SDR is essential • 2-tiered compilation process • System compilation • DSP compilation • System compilation is function-level scheduling • Hierarchical dataflow IR • ~4x saving in memory buffer allocation • SMT-based modulo scheduling • Linear speedup up to 8 PEs • Resulting in ~23% faster schedules than greedy

  23. Questions

More Related