Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna,

Communication-Aware Stochastic Allocation and Scheduling Framework for Conditional Task Graphsin Multi-Processor Systems-on-Chip Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna, DEIS - Italy

Outline • Motivations • Our approach • Problem Model • Methodology • Experimental Results • Conclusions

T8 T7 T1 T2 T4 T3 T5 T6 … Proc. 1 Proc. N Proc. 2 T1 INTERCONNECT T2 T3 … Private Mem Private Mem Private Mem T5 T4 T6 Deadline T7 Resources T3 T5 T7 T8 T1 T2 T4 T8 Time Task Graph Allocation • Many realistic applications can only be specified as conditional task graphs • The problem of allocating and scheduling conditional task graphs on processors in a distributed real-time system is NP-hard. • New tool flows for efficient mapping of multi-task applications onto hardware platforms Schedule

Optimization Development . . ( Design flow graph Abstraction gap • The abstraction gap between high level optimization tools and standard application programming models can introduceunpredictableand undesired behaviours. • Programmers must be conscious about simplified assumptions taken into account in optimization tools. • New methodology for multi-task application development on MPSoCs. Platform Modelling Starting Implementation Optimization Analysis Final Implementation Optimal Solution Platform Execution

Our approach Our Focus: • Statically scheduled Conditional Task Graph Applications; Our Objectives: • Complete approach to allocation and scheduling: • High computational efficiency w.r.t. commercial solvers; • High accuracy of generated solutions; • New methodology for multi-task application development: • To quicklydevelop multi-task applications; • To easilyapply the optimal solution found by our optimizer.

Act. A Act. B period T . . . . Act. N Target architecture - 1 • An architectural template for a message-oriented distributed memory MPSoC: • Support for message exchange between the computation tiles; • Single-token communication; • Availability of local memory devices at the computation tiles and of remote memories for program data. • Several MPSoC platforms available on the market match this template: • The Silicon Hive Avispa-CH1 processor; • The Cradle CT3600 family of multiprocessor; • The Cell Processor • The ARM MPCore platform. • The throughput requirement is reflected in the maximum tolerable scheduling period T of each processor;

Target architecture - 2 • Homogeneous computation tiles: • ARM cores (including instruction and data caches); • Tightly coupled software-controlled scratch-pad memories (SPM); • AMBA AHB; • DMA engine; • RTEMS OS; • Cores use non-cacheable shared memory to communicate; • Semaphore and interrupt facilities are used for synchronization; • Private on-chip memory to store data.

Target Application: Conditional Task Graph (CTG) • Seldom target applications behaves in same ways between several executions: they contain cycles, conditional jumps or other elements of variability. FORK • A CTG is a triple <T,A,C>, where: • T is the set of nodes modelling generic tasks (e.g. elementary operations, subprograms, ...); • A the set of arcs modelling precedence constraints (e.g. due to data communication); • C is a set of conditions, each one associated to an arc, modelling what should be true in order to choose that branch during execution (e.g. the condition of a if-then-else construct). • Extension to the generic task graph model with stochastic elements: • Conditional Branches; • Conditional Nodes; • Branch Nodes. AND BRANCH N N N OR

ARM Core Private Mem Semaphores #1 #2 Int controller System Bus SPM ARM Core Private Mem SPM Semaphores Int controller Task memory requirements • Each task has three kinds of memory requirements: • Program Data; • Internal State; • Communication queues. • Program Data & Internal State can be allocated by Optimizer: • On the local SPM; • On the remote Private Memory. • The communication task might run: • On the same processor → negligible communication cost • On a remote processor→ costly message exchange procedure • Optimizer constraint: • Communication queues only in SPM →more efficient message passing

Logic Based Benders Decomposition Obj. Function: Communication cost Memory constraints ALLOCATION: INTEGER PROGRAMMING • Decomposes a problem into 2 sub-problems: • Allocation → IP • Scheduling → CP • The process continues until the master problem and sub-problem converge providing the same value. • Methodology has been proven to converge to the optimal solution[J.N.Hooker and G.Ottosson]. No good: linear constraint Valid allocation Real Time constraint SCHEDULING: CONSTRAINT PROGRAMMING

Each process can execute only on one processor Program data and internal state can be allocated locally on a PE only if the task run on it The sum of locally allocated structures cannot exceed the SPM capacity Allocation problem model Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j. Communication queue of arcr can be locally only if both the source and the destination tasks run on a PEj

CPU CPU Bus Mem Allocation problem model The objective function: the minimization of the amount of data transferred on the bus Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j.

Bus Traffic modelling Equal to 1 if task i internal state is remotely allocated Equal to 1 if task i program data is remotely allocated Activation function equal to 1 if task i executes Activation function equal to 1 if task i and k execute Equal to 1 if communication queue is remotely allocated

Bus Traffic modelling Given an allocation these two terms are constants The minimization of a stochastic function is a very complex operation (even more than exponential)

Bus Traffic modelling Existence and coexistence probabilities of tasks Constant terms Every stochastic dependence is removed And The expected value is reduced to a deterministic expression We developed two polynomial cost algorithms to compute these probabilities

Scheduling problem model INPUT RS EXEC WS OUTPUT • Five phases behaviour • INPUT=input data reading; • RS=internal state reading; • EXEC=computation activity; • WS=internal state writing; • OUTPUT=output data writing. • Not breakable activities • The adopted schema and precedence relations vary with the type of the corresponding node (or/and, branch/fork) Since the objective function depends only on the allocation, Scheduling is just a feasibility problem We decided to provide a unique worst case schedule, forcing each task to execute after all its predecessors in any scenario

Efficient Application Development Support • In optimization tools many simplifying assumptions are generally considered • The neglecting of these assumptions in software implementation can generate: • unpredictable and not desired system-level interactions; • make the overall system error-prone. • We propose an entire framework to help programmers in software implementation: • a generic customizable application template  OFFLINE SUPPORT; • a set of high-level APIs  ONLINE SUPPORT. • The main goals of our development framework are: • theexact and reliable application’s execution after the optimization step; • guarantees about high performanceandconstraint satisfaction.

Customizable Application Template • Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure. • Programmer can intuitively translate high level representation intoC-code using our facilities and library. • Users can specify: • the number of tasks included in the target application; • their nature (e.g. branch, fork, or-node, and-node); • their precedence constraints (e.g. due to data communication); ….thus quickly drawing its CTG schema. • Programmer can focus onto the functionalities of the tasks: • the main effort is given to the more specific and critic sections of the application.

OS-level and Task-level APIs • Users can easily reproduce optimizer solutions, thus: • Indirectly neglecting optimizer’s abstractions • Task model; • Communication model; • OS overheads. • Obtaining the needed application constraint satisfaction. • Programmer can allocate to the right hardware resources • Tasks; • Program data; • Queues. • Scheduling support APIs • Communication issues • Shared queues; • Semaphores; • Interrupts.

P1 Example N1 T1 P2 a1 a2 fork T2 B2 T3 B3 branch branch a3 a4 a5 a6 • Number of nodes : 12 • Graph of activities • Node type • Normal, Branch, Conditional, Terminator • Node behaviour • Or, And, Fork, Branch • Number of CPU : 2 • Task Allocation • Task Scheduling • Arc priorities T5 T6 T4 T7 C5 C6 C7 C4 a7 a8 a9 a10 or T10 N9 T8 T9 N8 N10 a12 or a11 //Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTIC uint node_type[TASK_NUMBER] = {1,2,2,1,..}; uint queue_consumer [..] [..] = { {0,1,1,0,..}, {0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..}; #define TASK_NUMBER 12 N11 T11 a13 #define N_CPU 2 uint task_on_core[TASK_NUMBER] = {1,1,2,1}; int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..}; //Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH uint node_behaviour[TASK_NUMBER] = {2,3,3,..}; and a14 T12 T12 Deadline Resources B3 B3 C7 N10 N10 C7 N1 B2 C4 N8 N11 T12 T12 Time

Queue ordering optimization CPU1 CPU2 T1 • Communication ordering affects system performances Wait! C3 C1 RUN! T4 C2 T2 C4 C5 T3 T5 T6 … … … …

Synchronization among tasks T1 Proc. 1 Proc. 2 C1 T2 T4 T3 T4 T1 T2 C2 C3 T4 is suspended T4 re-activated T3 Non blocked semaphores

CTG Application Development Methodology Simulator Optimizer Application Profiles Optimization Phase Characterization Phase Allocation Scheduling Application Development Support Optimal SW Application Implementation Platform Execution

Computational Efficiency • 2 groups of instances: • slightly structured • very short tracks • quite often contain singleton nodes; • completely structured • one head, one tail, long tracks • The solution times are of the same order of the deterministic case

Optimal Allocation & Schedule Virtual Platform validation Validation of optimizer solutions Optimizer • MAX error lower than 10%; • AVG error equal to 4.8%, with standard deviation of 2.41;

Validation of optimizer solutions • Differences are marginal; • All the deadline constraints are satisfied.

Conclusions • Cooperative framework to solve the allocation and scheduling problem to optimality for conditional task graphs onto MPSoCs; • Logic-Based Benders Decomposition; • New development methodology; • Solutions validated by means of a complete MPSoC virtual platform; • Experimental results proved accuracy of the problem model.

Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna,

Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna,

Presentation Transcript

UNIVERSITY OF BOLOGNA

University of Bologna (UOB)

Michela Constant

Lino Miramonti Milano University and INFN sez. Milano

University of Bologna

Abbas Rahimi, Andrea Marongiu , Rajesh K. Gupta, Luca Benini

Luca Amendola University of Heidelberg

Michele Chinosi University of Insubria – Varese (Italy)

THE UNIVERSITY OF BOLOGNA TODAY

University of Milano

Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

WP7 – Dissemination University of Bologna

G. Battistoni, A. Margiotta, S. Muraro , M. Sioli (University and INFN of Bologna and Milano)

Silvia Medri University of Bologna

Umberto Cherubini University of Bologna

Interoperability Demos and Atlas applications Luca Vaccarossa (INFN - Milano)

Dept. of Engineering, Computer Science, and Systems University of Bologna, Bologna, Italy