330 likes | 507 Views
Communication-Aware Stochastic Allocation and Scheduling Framework for Conditional Task Graphs in Multi-Processor Systems-on-Chip. Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna, DEIS - Italy. Outline. Motivations Our approach Problem Model
E N D
Communication-Aware Stochastic Allocation and Scheduling Framework for Conditional Task Graphsin Multi-Processor Systems-on-Chip Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna, DEIS - Italy
Outline • Motivations • Our approach • Problem Model • Methodology • Experimental Results • Conclusions
T8 T7 T1 T2 T4 T3 T5 T6 … Proc. 1 Proc. N Proc. 2 T1 INTERCONNECT T2 T3 … Private Mem Private Mem Private Mem T5 T4 T6 Deadline T7 Resources T3 T5 T7 T8 T1 T2 T4 T8 Time Task Graph Allocation • Many realistic applications can only be specified as conditional task graphs • The problem of allocating and scheduling conditional task graphs on processors in a distributed real-time system is NP-hard. • New tool flows for efficient mapping of multi-task applications onto hardware platforms Schedule
Optimization Development . . ( Design flow graph Abstraction gap • The abstraction gap between high level optimization tools and standard application programming models can introduceunpredictableand undesired behaviours. • Programmers must be conscious about simplified assumptions taken into account in optimization tools. • New methodology for multi-task application development on MPSoCs. Platform Modelling Starting Implementation Optimization Analysis Final Implementation Optimal Solution Platform Execution
Outline • Motivations • Our approach • Problem Model • Methodology • Experimental Results • Conclusions
Our approach Our Focus: • Statically scheduled Conditional Task Graph Applications; Our Objectives: • Complete approach to allocation and scheduling: • High computational efficiency w.r.t. commercial solvers; • High accuracy of generated solutions; • New methodology for multi-task application development: • To quicklydevelop multi-task applications; • To easilyapply the optimal solution found by our optimizer.
Act. A Act. B period T . . . . Act. N Target architecture - 1 • An architectural template for a message-oriented distributed memory MPSoC: • Support for message exchange between the computation tiles; • Single-token communication; • Availability of local memory devices at the computation tiles and of remote memories for program data. • Several MPSoC platforms available on the market match this template: • The Silicon Hive Avispa-CH1 processor; • The Cradle CT3600 family of multiprocessor; • The Cell Processor • The ARM MPCore platform. • The throughput requirement is reflected in the maximum tolerable scheduling period T of each processor;
Target architecture - 2 • Homogeneous computation tiles: • ARM cores (including instruction and data caches); • Tightly coupled software-controlled scratch-pad memories (SPM); • AMBA AHB; • DMA engine; • RTEMS OS; • Cores use non-cacheable shared memory to communicate; • Semaphore and interrupt facilities are used for synchronization; • Private on-chip memory to store data.
Target Application: Conditional Task Graph (CTG) • Seldom target applications behaves in same ways between several executions: they contain cycles, conditional jumps or other elements of variability. FORK • A CTG is a triple <T,A,C>, where: • T is the set of nodes modelling generic tasks (e.g. elementary operations, subprograms, ...); • A the set of arcs modelling precedence constraints (e.g. due to data communication); • C is a set of conditions, each one associated to an arc, modelling what should be true in order to choose that branch during execution (e.g. the condition of a if-then-else construct). • Extension to the generic task graph model with stochastic elements: • Conditional Branches; • Conditional Nodes; • Branch Nodes. AND BRANCH N N N OR
ARM Core Private Mem Semaphores #1 #2 Int controller System Bus SPM ARM Core Private Mem SPM Semaphores Int controller Task memory requirements • Each task has three kinds of memory requirements: • Program Data; • Internal State; • Communication queues. • Program Data & Internal State can be allocated by Optimizer: • On the local SPM; • On the remote Private Memory. • The communication task might run: • On the same processor → negligible communication cost • On a remote processor→ costly message exchange procedure • Optimizer constraint: • Communication queues only in SPM →more efficient message passing
ARM Core Private Mem Semaphores #1 #2 Int controller System Bus SPM ARM Core Private Mem SPM Semaphores Int controller Task memory requirements • Each task has three kinds of memory requirements: • Program Data; • Internal State; • Communication queues. • Program Data & Internal State can be allocated by Optimizer: • On the local SPM; • On the remote Private Memory. • The communication task might run: • On the same processor → negligible communication cost • On a remote processor→ costly message exchange procedure • Optimizer constraint: • Communication queues only in SPM →more efficient message passing
Outline • Motivations • Our approach • Problem Model • Methodology • Experimental Results • Conclusions
Logic Based Benders Decomposition Obj. Function: Communication cost Memory constraints ALLOCATION: INTEGER PROGRAMMING • Decomposes a problem into 2 sub-problems: • Allocation → IP • Scheduling → CP • The process continues until the master problem and sub-problem converge providing the same value. • Methodology has been proven to converge to the optimal solution[J.N.Hooker and G.Ottosson]. No good: linear constraint Valid allocation Real Time constraint SCHEDULING: CONSTRAINT PROGRAMMING
Each process can execute only on one processor Program data and internal state can be allocated locally on a PE only if the task run on it The sum of locally allocated structures cannot exceed the SPM capacity Allocation problem model Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j. Communication queue of arcr can be locally only if both the source and the destination tasks run on a PEj
CPU CPU Bus Mem Allocation problem model The objective function: the minimization of the amount of data transferred on the bus Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j.
Bus Traffic modelling Equal to 1 if task i internal state is remotely allocated Equal to 1 if task i program data is remotely allocated Activation function equal to 1 if task i executes Activation function equal to 1 if task i and k execute Equal to 1 if communication queue is remotely allocated
Bus Traffic modelling Given an allocation these two terms are constants The minimization of a stochastic function is a very complex operation (even more than exponential)
Bus Traffic modelling Existence and coexistence probabilities of tasks Constant terms Every stochastic dependence is removed And The expected value is reduced to a deterministic expression We developed two polynomial cost algorithms to compute these probabilities
Scheduling problem model INPUT RS EXEC WS OUTPUT • Five phases behaviour • INPUT=input data reading; • RS=internal state reading; • EXEC=computation activity; • WS=internal state writing; • OUTPUT=output data writing. • Not breakable activities • The adopted schema and precedence relations vary with the type of the corresponding node (or/and, branch/fork) Since the objective function depends only on the allocation, Scheduling is just a feasibility problem We decided to provide a unique worst case schedule, forcing each task to execute after all its predecessors in any scenario
Outline • Motivations • Our approach • Problem Model • Methodology • Experimental Results • Conclusions
Efficient Application Development Support • In optimization tools many simplifying assumptions are generally considered • The neglecting of these assumptions in software implementation can generate: • unpredictable and not desired system-level interactions; • make the overall system error-prone. • We propose an entire framework to help programmers in software implementation: • a generic customizable application template OFFLINE SUPPORT; • a set of high-level APIs ONLINE SUPPORT. • The main goals of our development framework are: • theexact and reliable application’s execution after the optimization step; • guarantees about high performanceandconstraint satisfaction.
Customizable Application Template • Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure. • Programmer can intuitively translate high level representation intoC-code using our facilities and library. • Users can specify: • the number of tasks included in the target application; • their nature (e.g. branch, fork, or-node, and-node); • their precedence constraints (e.g. due to data communication); ….thus quickly drawing its CTG schema. • Programmer can focus onto the functionalities of the tasks: • the main effort is given to the more specific and critic sections of the application.
OS-level and Task-level APIs • Users can easily reproduce optimizer solutions, thus: • Indirectly neglecting optimizer’s abstractions • Task model; • Communication model; • OS overheads. • Obtaining the needed application constraint satisfaction. • Programmer can allocate to the right hardware resources • Tasks; • Program data; • Queues. • Scheduling support APIs • Communication issues • Shared queues; • Semaphores; • Interrupts.
P1 Example N1 T1 P2 a1 a2 fork T2 B2 T3 B3 branch branch a3 a4 a5 a6 • Number of nodes : 12 • Graph of activities • Node type • Normal, Branch, Conditional, Terminator • Node behaviour • Or, And, Fork, Branch • Number of CPU : 2 • Task Allocation • Task Scheduling • Arc priorities T5 T6 T4 T7 C5 C6 C7 C4 a7 a8 a9 a10 or T10 N9 T8 T9 N8 N10 a12 or a11 //Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTIC uint node_type[TASK_NUMBER] = {1,2,2,1,..}; uint queue_consumer [..] [..] = { {0,1,1,0,..}, {0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..}; #define TASK_NUMBER 12 N11 T11 a13 #define N_CPU 2 uint task_on_core[TASK_NUMBER] = {1,1,2,1}; int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..}; //Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH uint node_behaviour[TASK_NUMBER] = {2,3,3,..}; and a14 T12 T12 Deadline Resources B3 B3 C7 N10 N10 C7 N1 B2 C4 N8 N11 T12 T12 Time
Queue ordering optimization CPU1 CPU2 T1 • Communication ordering affects system performances Wait! C3 C1 RUN! T4 C2 T2 C4 C5 T3 T5 T6 … … … …
Queue ordering optimization CPU1 CPU2 T1 • Communication ordering affects system performances Wait! C3 C1 RUN! T4 C2 T2 C4 C5 T3 T5 T6 … … … …
Synchronization among tasks T1 Proc. 1 Proc. 2 C1 T2 T4 T3 T4 T1 T2 C2 C3 T4 is suspended T4 re-activated T3 Non blocked semaphores
CTG Application Development Methodology Simulator Optimizer Application Profiles Optimization Phase Characterization Phase Allocation Scheduling Application Development Support Optimal SW Application Implementation Platform Execution
Outline • Motivations • Our approach • Problem Model • Methodology • Experimental Results • Conclusions
Computational Efficiency • 2 groups of instances: • slightly structured • very short tracks • quite often contain singleton nodes; • completely structured • one head, one tail, long tracks • The solution times are of the same order of the deterministic case
Optimal Allocation & Schedule Virtual Platform validation Validation of optimizer solutions Optimizer • MAX error lower than 10%; • AVG error equal to 4.8%, with standard deviation of 2.41;
Validation of optimizer solutions • Differences are marginal; • All the deadline constraints are satisfied.
Conclusions • Cooperative framework to solve the allocation and scheduling problem to optimality for conditional task graphs onto MPSoCs; • Logic-Based Benders Decomposition; • New development methodology; • Solutions validated by means of a complete MPSoC virtual platform; • Experimental results proved accuracy of the problem model.