440 likes | 467 Views
This project by Peter Aronsson focuses on parallelizing simulation code from Modelica, an object-oriented, equation-based modeling language, to decrease simulation execution time and handle complex multi-domain systems. The research covers task graphs, scheduling algorithms, and task clustering/merging techniques to optimize parallel computation. The work aims to enable efficient simulation of large models and meet real-time demands in hardware-in-the-loop simulations.
E N D
Automatic Parallelization of Simulation Code from Equation Based Simulation Languages Peter Aronsson, Industrial phd student, PELAB SaS IDA Linköping University, Sweden Based on Licentiate presentation & CPC’03 Presentation Peter Aronsson
Outline • Introduction • Task Graphs • Related work on Scheduling & Clustering • Parallelization Tool • Contributions • Results • Conclusion & Future Work Peter Aronsson
Introduction • Modelica • Object Oriented, Equation Based, Modeling Language • Modelica enable modeling and simulation of large and complex multi-domain systems • Large need for parallel computation • To decrease time of executing simulations • To make large models possible to simulate at all. • To meet hard real time demands in hardware-in-the-loop simulations Peter Aronsson
Examples of large complex systems in Modelica Peter Aronsson
Modelica Example - DCmotor Peter Aronsson
Modelica example model DCMotor import Modelica.Electrical.Analog.Basic.*; import Modelica.Electrical.Sources.StepVoltage; Resistor R1(R=10); Inductor I1(L=0.1); EMF emf(k=5.4); Ground ground; StepVoltage step(V=10); Modelica.Mechanics.Rotational.Inertia load(J=2.25); equation connect(R1.n, I1.p); connect(I1.n, emf.p); connect(emf.n, ground.p); connect(emf.flange_b, load.flange_a); connect(step.p, R1.p); connect(step.n, ground.p); end DCMotor; Peter Aronsson
Example – Flat set of Equations R1.v = -R1.n.v+R1.p.v 0 = R1.n.i+R1.p.i R1.i = R1.p.i R1.i*R1.R = R1.v I1.v = -I1.n.v+I1.p.v 0 = I1.n.i+I1.p.i I1.i = I1.p.i I1.L*I1.der(i) = I1.v emf.v =-emf.n.v+emf.p.v 0 = emf.n.i+emf.p.i emf.i = emf.p.i emf.w = emf.flange_b.der(phi) emf.k*emf.w = emf.v emf.flange_b.tau = -emf.i*emf.k ground.p.v = 0 step.v = -step.n.v+step.p.v 0 = step.n.i+step.p.i step.i = step.p.i step.signalSource.outPort.signal[1] = (if time < step.signalSource.p_startTime[1] then 0 else step.signalSource.p_height[1])+step.signalSource.p_offset[1] step.v = step.signalSource.outPort.signal[1] load.flange_a.phi = load.phi load.flange_b.phi = load.phi load.w = load.der(phi) load.a = load.der(w) load.a*load.J = load.flange_a.tau+load.flange_b.tau R1.n.v = I1.p.v I1.p.i+R1.n.i = 0 I1.n.v = emf.p.v emf.p.i+I1.n.i = 0 emf.n.v = step.n.v step.n.v = ground.p.v emf.n.i+ground.p.i+step.n.i = 0 emf.flange_b.phi = load.flange_a.phi emf.flange_b.tau+load.flange_a.tau = 0 step.p.v = R1.p.v R1.p.i+step.p.i = 0 load.flange_b.tau = 0 step.signalSource.y = step.signalSource.outPort.signal Peter Aronsson
Plot of Simulation result • load.flange_a.tau • load.w Peter Aronsson
Task Graphs • Directed Acyclic Graph (DAG) G = (V,E, t,c) V – Set of nodes, representing computational tasks E – Set of edges, representing communication of data between tasks t(v) – Execution cost for node v c(i,j) – Communication cost for edge (i,j) • Referred to as the delay model (macro dataflow model) Peter Aronsson
7 1 1 2 3 2 2 1 4 1 5 2 6 2 8 1 Small Task Graph Example 10 5 5 5 5 10 10 10 Peter Aronsson
Task Scheduling Algorithms • Multiprocessor Scheduling Problem • For each task, assign • Starting time • Processor assignment (P1,...PN) • Goal: minimize execution time, given • Precedence constraints • Execution cost • Communication cost • Algorithms in literature • List Scheduling approaches (ERT, FLB) • Critical Path scheduling approaches (TDS, MCP) • Categories: Fixed No. of Proc, fixed c and/or t, ... Peter Aronsson
Granularity • Granularity g = min(t(v))/max(c(i,j)) • Affects scheduling result • E.g. TDS works best for high values of g, i.e. low communication cost • Solutions: • Clustering algorithms • IDEA: build clusters of nodes where nodes in the same cluster are executed on the same processor • Merging algorithms • Merge tasks to increase computational cost. Peter Aronsson
Task Clustering/Merging Algorithms • Task Clustering Problem: • Build clusters of nodes such that parallel time decreases • PT(n) = tlevel(n)+blevel(n) • By zeroing edges, i.e. putting several nodes into the same cluster => zero communication cost. • Literature: • Sarkars Internalization alg., Yangs DSC alg. • Task Merging Problem • Transform the Task Graph by merging nodes • Literature: E.g. Grain Packing alg. Peter Aronsson
5 2 7 1 6 2 5 2 4 1 2 1 3 2 1 2 8 1 2 1 1 2 8 1 4 1 6 2 7 1 8 1 1 2 3,6 6 2,5,6 4 7 1 3 2 Clustering v.s. Merging 10 5 5 0 5 5 5 0 0 0 10 10 10 merging 10 5 0 0 10 10 10 10 Clustered Task Graph Merged Task Graph Peter Aronsson
DSC algorithm • Initially, put each node a separate cluster. • Traverse Task Graph • Merge clusters as long as Parallel Time does not increase. • Low complexity O((n+e) log n) • Previously used by Andersson in ObjectMath (PELAB) Peter Aronsson
Modelica Compilation Numerical solver Modelica semantics Equation system (DAE) Opt. Rhs calculations C code Flat modelica (.mof) Structure of simulation code: for t=0;t<stopTime;t+=stepSize { x_dot[t+1] = f(x_dot[t],x[t],t); x[t+1] = ODESolver(x_dot[t+1]); } Modelica model (.mo) Peter Aronsson
0 a b c d e Optimizations on equations • Simplification of equations E.g. a=b, b=c eliminate => b • BLT transformation, i.e. topological sorting into strongly connected components (BLT = Block Lower Triangular form) • Index reduction, Index is how many times an equation needs to be differentiated in order to solve the equation system. • Mixed Mode /Inline Integration, methods of optimizing equations by reducing size of equation systems Peter Aronsson
Generated C Code Content • Assignment statements • Arithmetic expressions (+,-,*,/), if-expressions • Function calls • Standard Math functions • Sin, Cos, Log • Modelica Functions • User defined, side effect free • External Modelica Functions • In External lib, written in Fortran or C • Call function for solving subsystems of equations • Linear or non-linear • Example Application • Robot simulation has 27 000 lines of generated C code Peter Aronsson
Parallelization Tool Overview Model .mo Modelica Compiler Parallelizer C code Parallel C code Solver lib MPI lib C compiler C compiler Seq exe Parallel exe Peter Aronsson
Parallelization Tool Internal Structure Sequential C code Parser Symbol Table Task Graph Builder Scheduler Debug & Statistics Code Generator Parallel C code Peter Aronsson
+,-,* +,* foo /,- Task Graph building • First graph: corresponds to individual arithmetic operations, assignments, function calls and variable definitions in the C code • Second graph: Clusters of tasks from first task graph Example: defs a b c + d + - * * foo - / Peter Aronsson
Investigated Scheduling Algorithms • Parallelization Tool • TDS (Task Duplications Scheduling Algorithm) • Pre – Clustering Method • Full Task Duplication Method • Experimental Framework (Mathematica) • ERT • DSC • TDS • Full Task Duplication Method • Task Merging approaches (Graph Rewrite Systems) Peter Aronsson
Method 1:Pre Clustering algorithm • buildCluster(n:node, l:list of nodes, size:Integer) • Adds n to a new cluster • Repeatedly adds nodes until the size(cluster)=size • Children to n • One in-degree children to cluster • Siblings to n • Parents to n • Arbitrary nodes Peter Aronsson
Managing cycles • When adding a node to a cluster the resulting graph might have cycles • Resulting graph when clustering a and b is cyclic since you can reach {a,b} from c • Resulting graph not a DAG • Can not use standard scheduling algorithms a c d b e Peter Aronsson
Pre Clustering Results • Did not produce Speedup • Introduced far too many dependencies in resulting task graph • Sequentialized schedule • Conclusion: • For fine grained task graphs: • Need task duplication in such algorithm to succeed Peter Aronsson
Method 2: Full Task Duplication • For each node:n with successor(n)={} • Put all pred(n) in one cluster • Repeat for all nodes in cluster • Rationale: If depth of graph limited, task duplication will be kept at reasonable level and cluster size reasonable small. • Works well when communication cost >> execution cost Peter Aronsson
Full Task Duplication (2) • Merging clusters • Merge clusters with load balancing strategy, without increasing maximum cluster size • Merge clusters with greatest number of common nodes • Repeat (2) until number of processors requirement is met Peter Aronsson
Full Task Duplication Results • Computed measurements • Execution cost of largest cluster + communication cost • Measured speedup • Executed on PC Linux cluster SCI network interface, using SCAMPI Peter Aronsson
Robot Example Computed Speedup • Mixed Mode / Inline Integration With MM/II Without MM/II Peter Aronsson
Thermofluid pipe executed on PC Cluster • Pressurewavedemo in Thermofluid package 50 discretization points Peter Aronsson
Thermofluid pipe executed on PC Cluster • Pressurewavedemo in Thermofluid package 100 discretization points Peter Aronsson
Task Merging using GRS • Idea: A set of simple rules to transform a task graph to increase its granularity (and decrease Parallel Time) • Use top level (and bottom level) as metric: • Parallel Time = max tlevel + max blevel Peter Aronsson
Rule 1 • Merging a single child with only one parent. • Motivation: The merge does not decrease amount of parallelism in the task graph. And granularity can possibly increase. p p’ c Peter Aronsson
Rule 2 • Merge all parents of a node together with the node itself. • Motivation: If the top level does not increase by the merge the resulting task will increase in size, potentially increasing granularity. p1 p2 pn … c’ c Peter Aronsson
Rule 3 • Duplicate parentand merge into each child node • Motivation: As long as each child’s tlevel does not increase, duplicating p into the child will reduce the number of nodes and increase granularity. p … c2’ c1’ cn’ … c2 c1 cn Peter Aronsson
Rule 4 • Merge siblings into a single node as long as a parameterized maximum execution cost is not exceeded. • Motivation: This rule can be useful if several small predecessor nodes exist and a larger predecessor node which prevents a complete merge. Does not guarantee decrease of PT. … p´ Pk+1 pn … p1 p2 pn c c Peter Aronsson
Results – Example • Task graph from Modelica simulation code • Small example from the mechanical domain. • About 100 nodes built on expression level, originating from 84 equations & variables Peter Aronsson
Result Task Merging example • B=1, L=1 Peter Aronsson
Result Task Merging example • B=1, L=10 • B=1, L=100 Peter Aronsson
Conclusions • Pre Clustering approach did not work well for the fine grained task graphs produced by our parallelization tool • FTD Method • Works reasonable well for some examples • However, in general: • Need for better scheduling/clustering algorithms for fine grained task graphs Peter Aronsson
Conclusions (2) • Simple delay model may not be enough • More advanced model require more complex scheduling and clustering algorithms • Simulation code from equation based models • Hard to extract parallelism from • Need new optimization methods on DAE:s or ODE:s to increase parallelism Peter Aronsson
Conclusions Task Merging using GRS • A task merging algorithm using GRS have been proposed • Four rules with simple patterns => fast pattern matching • Can easily be integrated in existing scheduling tools. • Successfully merges tasks considering • Bandwidth & Latency • Task duplication • Merging criterion: decrease Parallel Time, by decreasing tlevel (PT) • Tested on examples from simulation code Peter Aronsson
Future Work • Designing and Implementing Better Scheduling and Clustering Algorithms • Support for more advanced task graph models • Work better for high granularity values • Try larger examples • Test on different architectures • Shared Memory machines • Dual processor machines Peter Aronsson
Future Work (2) • Heterogeneous multiprocessor systems • Mixed DSP processors, RISC,CISC, etc. • Enhancing Modelica language with data parallelism • e.g. parallel loops, vector operations • Parallelize e.g. combined PDE and ODE problems in Modelica. • Using e.g. SCALAPACK for solving subsystems of linear equations. How to integrate into scheduling algorithms? Peter Aronsson