280 likes | 544 Views
Adaptive Latency-Aware Parallel Resource Mapping : Task Graph Scheduling Heterogeneous Network Topology. Liwen Shih , Ph.D. Computer Engineering U of Houston – Clear Lake shih@uhcl.edu. ADAPTIVE PARALLEL TASK TO NETWORK TOPOLOGY MAPPING. Latency-adaptive: Topology Traffic Bandwidth
E N D
Adaptive Latency-Aware Parallel Resource Mapping:Task Graph Scheduling Heterogeneous Network Topology Liwen Shih, Ph.D. Computer Engineering U of Houston – Clear Lake shih@uhcl.edu
ADAPTIVE PARALLEL TASK TO NETWORK TOPOLOGY MAPPING Latency-adaptive: • Topology • Traffic • Bandwidth • Workload • System hierarchy Thread partition: • Coarse • Medium • Fine
Fine-Grained Mapping System[Shih 1988] • Parallel Mapping • Compiler-vs. run- time • Task migration • Vertical vs. Horizontal • Domain decomposition • Data vs. Function • Execution order • Eager data-driven vs. Lazy demand-driven
PRIORITIZE TASK DFG NODES Task priority factors: • Level depth • Critical Paths • In/Out degree Data flow partial order: {(n7n5), (n7n4), (n6n4), (n6n3), (n5n1), (n4n2), (n3n2), (n2n1)} total task priority order: {n1 > n2 > n4 > n3 > n5 > n6 > n7} • P2 thread: {n1>n2>n4>n3>n6} P3 thread: {n5 > n7}
SHORTEST-PATH NETWORK ROUTING Shortest latency and routes are updated after each task-processor allocation.
Adaptive A* Parallel Processor Scheduler • Given a directed, acyclic task DFG G(V, E) with task vertex set V connected by data-flow edge set E, And a processor network topology N(P , C) with processor node set P connected by channel link set C • Find a processor assignment and schedule S: V(G) P (N) S minimizes total parallel computation time of G. • A* Heuristic mapping reduces scheduling complexity from NP to P
Demand-Driven Task-Topology mapping • STEP 1 – assign a level to each task node vertex in G. • STEP 2 – count critical paths passing through each DFG edge and node with a 2-pass bottom-up and then up-down graph traversal. • STEP 3 – initially load and prioritize all deepest level task nodes that produce outputs, to the working task node list. • STEP 4 – WHILE working task node list is not empty, schedule a best processor to the top priority task, and replace it with its parent task nodes inserted onto the working task node priority list.
Demand-Driven Processor Scheduling STEP 4 – WHILE working task node list is not empty: BEGIN • STEP 4.1 – initialize if first time, otherwise update inter-processor shortest-path latency/routing table pair affected by last task-processor allocation. • STEP 4.2 – assign a nearby capable processorto minimize thread computation time for the highest priority task node at the top of the remaining prioritized working list. • STEP 4.3 – remove the newly scheduled task node, and replace it with its parent nodes, which are to be inserted/appended onto the working list (demand-driven) per priority, based on tie-breaker rules, which along with node level depth, estimate the time cost of the entire computation tread involved. END{WHILE}
QUANTIFY SW/HW MAPPING QUALITY • Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping • Example 2 – Scaling to Larger Tree-to-Tree Mapping • Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph
Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping K-th Largest Selection Will tree Algorithm [3] match tree machine [4]?
Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive mapping moves toward sequential processing when inter/intra communication latency ratio increase.
Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper allocates fewer processors and channels with fewer hops.
Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper achieves higher speedups consistently. (Bonus! 25.7+ pipeline processing speedup and be extrapolated when inter/intra communication latency ratio <1)
Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper results in better efficiencies consistently. (Bonus! 428.3+% pipeline processing efficiency can be extrapolated when inter/intra communication latency ratio <1)
Example 2 – Scaling to Larger Tree-to-Tree Mapping Adaptive Mapper achieves sub-optimal speedups as tree sizes scaled larger speedups, still trailing fixed tree-to-tree mapping closely.
Example 2 – Scaling to Larger Tree-to-Tree Mapping Adaptive Mapper is always more cost-efficient using less resource, with compatible sub-optimal speedups to fixed tree-to-tree mapping as tree sizes scaled.
Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Lack of matching topology clues for irregular shaped Robot Elbow Manipulator [5] • 105 task nodes, • 161 data flow edges • 29 node levels
Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph • Candidate topologies • Compare schedules for each topology • Farther processors may not be selected • Linear Array • Tree
Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Best network topology performers (# channels) • Complete (28) • Mesh (12) • Chordal ring (16) • Systolic array (16) • Cube (12)
Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Fewer processors selected for higher diameter networks • Tree • Linear Array
Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Deducing network switch hops • Low multi-hop data exchanges < 10% • Moderate 0-hop of 30% to 50% • High near-neighbor direct 1-hop 50% to 70%
Future Speed/Memory/Power Optimization • Latency-adaptive • Topology • Traffic • Bandwidth • Workload • System hierarchy • Thread partition • Coarse • Mid • Fine • Latency/Routing tables • Neighborhood • Network hierarchy • Worm-hole • Dynamic mobile network routing • Bandwidth • Heterogeneous system • Algorithm-specific network topology
Q & A? Liwen Shih, Ph.D. Professor in Computer Engineering University of Houston – Clear Lake shih@uhcl.edu