1 / 27

Liwen Shih , Ph.D. Computer Engineering U of Houston – Clear Lake shih@uhcl.edu

Adaptive Latency-Aware Parallel Resource Mapping : Task Graph Scheduling  Heterogeneous Network Topology. Liwen Shih , Ph.D. Computer Engineering U of Houston – Clear Lake shih@uhcl.edu. ADAPTIVE PARALLEL TASK TO NETWORK TOPOLOGY MAPPING. Latency-adaptive: Topology Traffic Bandwidth

naif
Download Presentation

Liwen Shih , Ph.D. Computer Engineering U of Houston – Clear Lake shih@uhcl.edu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Latency-Aware Parallel Resource Mapping:Task Graph Scheduling Heterogeneous Network Topology Liwen Shih, Ph.D. Computer Engineering U of Houston – Clear Lake shih@uhcl.edu

  2. ADAPTIVE PARALLEL TASK TO NETWORK TOPOLOGY MAPPING Latency-adaptive: • Topology • Traffic • Bandwidth • Workload • System hierarchy Thread partition: • Coarse • Medium • Fine

  3. Fine-Grained Mapping System[Shih 1988] • Parallel Mapping • Compiler-vs. run- time • Task migration • Vertical vs. Horizontal • Domain decomposition • Data vs. Function • Execution order • Eager data-driven vs. Lazy demand-driven

  4. PRIORITIZE TASK DFG NODES Task priority factors: • Level depth • Critical Paths • In/Out degree Data flow partial order: {(n7n5), (n7n4), (n6n4), (n6n3), (n5n1), (n4n2), (n3n2), (n2n1)}  total task priority order: {n1 > n2 > n4 > n3 > n5 > n6 > n7} • P2 thread: {n1>n2>n4>n3>n6} P3 thread: {n5 > n7}

  5. SHORTEST-PATH NETWORK ROUTING Shortest latency and routes are updated after each task-processor allocation.

  6. Adaptive A* Parallel Processor Scheduler • Given a directed, acyclic task DFG G(V, E) with task vertex set V connected by data-flow edge set E, And a processor network topology N(P , C) with processor node set P connected by channel link set C • Find a processor assignment and schedule S: V(G) P (N) S minimizes total parallel computation time of G. • A* Heuristic mapping reduces scheduling complexity from NP to P

  7. Demand-Driven Task-Topology mapping • STEP 1 – assign a level to each task node vertex in G. • STEP 2 – count critical paths passing through each DFG edge and node with a 2-pass bottom-up and then up-down graph traversal. • STEP 3 – initially load and prioritize all deepest level task nodes that produce outputs, to the working task node list. • STEP 4 – WHILE working task node list is not empty, schedule a best processor to the top priority task, and replace it with its parent task nodes inserted onto the working task node priority list.

  8. Demand-Driven Processor Scheduling STEP 4 – WHILE working task node list is not empty: BEGIN • STEP 4.1 – initialize if first time, otherwise update inter-processor shortest-path latency/routing table pair affected by last task-processor allocation. • STEP 4.2 – assign a nearby capable processorto minimize thread computation time for the highest priority task node at the top of the remaining prioritized working list. • STEP 4.3 – remove the newly scheduled task node, and replace it with its parent nodes, which are to be inserted/appended onto the working list (demand-driven) per priority, based on tie-breaker rules, which along with node level depth, estimate the time cost of the entire computation tread involved. END{WHILE}

  9. QUANTIFY SW/HW MAPPING QUALITY • Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping • Example 2 – Scaling to Larger Tree-to-Tree Mapping • Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

  10. Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping K-th Largest Selection Will tree Algorithm [3] match tree machine [4]?

  11. Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive mapping moves toward sequential processing when inter/intra communication latency ratio increase.

  12. Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper allocates fewer processors and channels with fewer hops.

  13. Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper achieves higher speedups consistently. (Bonus! 25.7+ pipeline processing speedup and be extrapolated when inter/intra communication latency ratio <1)

  14. Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper results in better efficiencies consistently. (Bonus! 428.3+% pipeline processing efficiency can be extrapolated when inter/intra communication latency ratio <1)

  15. Example 2 – Scaling to Larger Tree-to-Tree Mapping Adaptive Mapper achieves sub-optimal speedups as tree sizes scaled larger speedups, still trailing fixed tree-to-tree mapping closely.

  16. Example 2 – Scaling to Larger Tree-to-Tree Mapping Adaptive Mapper is always more cost-efficient using less resource, with compatible sub-optimal speedups to fixed tree-to-tree mapping as tree sizes scaled.

  17. Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Lack of matching topology clues for irregular shaped Robot Elbow Manipulator [5] • 105 task nodes, • 161 data flow edges • 29 node levels

  18. Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph • Candidate topologies • Compare schedules for each topology • Farther processors may not be selected • Linear Array • Tree

  19. Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Best network topology performers (# channels) • Complete (28) • Mesh (12) • Chordal ring (16) • Systolic array (16) • Cube (12)

  20. Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Fewer processors selected for higher diameter networks • Tree • Linear Array

  21. Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Deducing network switch hops • Low multi-hop data exchanges < 10% • Moderate 0-hop of 30% to 50% • High near-neighbor direct 1-hop 50% to 70%

  22. Future Speed/Memory/Power Optimization • Latency-adaptive • Topology • Traffic • Bandwidth • Workload • System hierarchy • Thread partition • Coarse • Mid • Fine • Latency/Routing tables • Neighborhood • Network hierarchy • Worm-hole • Dynamic mobile network routing • Bandwidth • Heterogeneous system • Algorithm-specific network topology

  23. References

  24. Q & A? Liwen Shih, Ph.D. Professor in Computer Engineering University of Houston – Clear Lake shih@uhcl.edu

  25. xScale13 paper

  26. Thank You!

More Related