970 likes | 1.15k Views
Parallel Programming and Timing Analysis on Embedded Multicores. Eugene Yip The University of Auckland Supervisors: Advisor: Dr. Partha Roop Dr . Alain Girault Dr. Morteza Biglari-Abhari (INRIA) ( UoA ). Outline. Introduction ForeC Language Timing Analysis Results
E N D
Parallel Programmingand Timing Analysis on Embedded Multicores Eugene Yip The University of Auckland Supervisors: Advisor: Dr.ParthaRoopDr. Alain Girault Dr.MortezaBiglari-Abhari (INRIA) (UoA)
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
Introduction • Safety-critical systems: • Perform specific real-time tasks. • Comply with strict safety standards [IEC 61508, DO 178] • Time-predictability useful in real-time designs. Embedded Systems Safety-critical concerns Timing/Functionality requirements [Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.
Introduction • Safety-critical systems: • Shift from single-core to multicore processors. • Cheaper, better power vs. execution performance. Core n Core 0 Shared System bus Resource Resource Shared Shared [Blake et al 2009] A Survey of Multicore Processors. [Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.
Introduction • Parallel programming: • From super computers to mainstream computers. • Frameworks designed for systems without resource constraints or safety-concerns. • Optimised for average-case performance (FLOPS), not time-predictability. • Threaded programming model. • Pthreads, OpenMP, Intel Cilk Plus, ParC, ... • Non-deterministic thread interleaving makes understanding and debugging hard. [Lee 2006] The Problem with Threads.
Introduction • Parallel programming: • Programmer responsible for shared resources. • Concurrency errors: • Deadlock, Race condition, Atomic violation, Order violation. [McDowell et al 1989] Debugging Concurrent Programs. [Lu et al 2008] Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics.
Introduction • Synchronous languages: • Deterministic concurrency (formal semantics). • Execution model similar to digital circuits. • Threads execute in lock-step to a global clock. • Threads communicate via instantaneous signals. • Concurrency is logical. Typically compiled away. Inputs 1 2 3 4 Global ticks Outputs [Benveniste et al 2003] The Synchronous Languages 12 Years Later.
Introduction • Synchronous languages: Must validate: max(Reaction time) < min(Time for each tick) Specified by the system’s timing requirements Time for a tick 1s 2s 3s 4s Physical time Reaction time [Benveniste et al 2003] The Synchronous Languages 12 Years Later.
Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Retain the essence of C and add deterministic concurrency and thread communication. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Concurrent threads scheduled sequentially in a cooperatively manner. This ensures thread-safe access to shared variables. Semantics designed to facilitate static analysis. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Read phase followed by write phase for shared variables. Multiple writes to the same shared variable are combined using an associative and commutative “combine function”. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language More expressive than PRET-C, but static timing analysis hasn’t been formulated yet. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Sequential execution semantics. Unsuitable for parallel execution. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Compilation produces sequential programs.Unsuitable for parallel execution. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
ForeC Language “Foresee” ForeC • C-based, multi-threaded, synchronous language. Inspired by PRET-C and Esterel. • Deterministic parallel execution on embedded multicores. • Fork/join parallelism and shared memory thread communication. • Program behaviour independent of chosen thread scheduling.
ForeC Language • Additional constructs to C: • pause: Synchronisation barrier. Pauses the thread’s execution until all threads have paused. • par(st1, ..., stn): Forks each statement to execute as a parallel thread. Each statement is implicitly scoped. • [weak] abortstwhen[immediate] exp: Preempts the statement st when exp evaluates to a non-zero value. exp is evaluated in each global tick before st is executed.
ForeC Language • Additional variable type-qualifiers to C: • inputand output: Declares a variable whose value is updated or emitted to the environment at each global tick.
ForeC Language • Additional variable type-qualifiers to C: • shared: Declares a shared variable that can be accessed by multiple threads.
ForeC Language • Additional variable type-qualifiers to C: • shared: Declares a shared variable that can be accessed by multiple threads. • Threads make local copies of shared variables that they may use at the start of their local ticks. • Threads only modify their local copies during execution. • If a par statement terminates: • Modified copies from the child threads are combined (using a commutative & associative function) and assigned to the parent. • If the global tick ends: • The modified copies are combined and assigned to the actual shared variables. a b
Execution Example Shared variable shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Commutative and associative combine function Fork-join Synchronisation
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... }
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end Global tick start
Execution Example Shared variables: • Threads modify local copies of shared variables. • Isolation of thread execution allows threads to truly execute in parallel. • Thread interleaving does no affect the program’s behaviour. • Prevents most concurrency errors. • Deadlock, Race condition: No locks. • Atomic and order violation: Local copies. • Copies for a shared variable can be split into groups and combined in parallel.
Execution Example Shared variables: • Programmer has to define a suitable combine function for each shared variable. • Must ensure the combine function is indeed commutative & associative. • Notion of “combine functions” is not entirely new: • Intel Cilk Plus, OpenMP, MPI, UPC, X10 • Esterel, Reactive Shared Variables [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems.
Execution Example Shared variables: • Programmer has to define a suitable combine function for each shared variable. • Must ensure the combine function is indeed commutative & associative. • Notion of “combine functions” is not entirely new: • Intel Cilk Plus, OpenMP, MPI, UPC, X10 • Esterel, Reactive Shared Variables Aggregates cilk::reducer_op cilk::holder_op shared var reduction(operator: var) MPI_Reduce MPI_Gather shared var collectives [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems.
Execution Example Shared variables: • Programmer has to define a suitable combine function for each shared variable. • Must ensure the combine function is indeed commutative & associative. • Notion of “combine functions” is not entirely new: • Intel Cilk Plus, OpenMP, MPI, UPC, X10 • Esterel, Reactive Shared Variables shared var Combine operator Valued signals Combine operator [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems.
Shared Variable Design Patterns • Point-to-point • Broadcast • Software pipelining • Divide and conquer • Scatter/Gather • Map/Reduce
Concurrent Control Flow Graph shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... }
Scheduling • Light-Weight Static Scheduling: • Take advantage of multicore performance while delivering time-predictability. • Generate code to execute directly on hardware (bare metal/no OS). • Thread allocation and scheduling order on each core decided at compile time by the programmer. • Develop a WCRT-aware scheduling heuristic. • Thread isolation allows for scheduling flexibility. • Cooperative (non-preemptive) scheduling.
Scheduling • Cores synchronise to fork/join threads and end each global tick. • One core to perform housekeeping tasks at the end of the global tick: • Combining shared variables. • Emitting outputs. • Sampling inputs and trigger the next global tick.
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
Timing Analysis Compute the program’s worst-case reaction time (WCRT). Must validate: max(Reaction time) < min(Time for each tick) Specified by the system’s timing requirements Time for a tick 1s 2s 3s 4s Physical time Reaction time [Benveniste et al 2003] The Synchronous Languages 12 Years Later.
Timing Analysis Existing approaches for synchronous programs: • Integer Linear Programming (ILP) • “Coarse-grained” Reachability (Max-Plus) • Model Checking One existing approach for analysing the WCRT of synchronous programs on multicores: • [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors. • Uses ILP, no tightness result, all experiments performed 4-core processor.
Timing Analysis Existing approaches for synchronous programs. • Integer Linear Programming (ILP) • Execution time of the program described as a set of integer equations. • Solving ILP is NP-complete. [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors.
Timing Analysis Existing approaches for synchronous programs. • “Coarse-grained” Reachability (Max-Plus) • Compute the WCRT of each thread. • Using the thread WCRTs, the WCRT of the program is computed. • Assumes there is a global tick where all threads execute their worst-case. [M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.
Timing Analysis Existing approaches for synchronous programs. • Model Checking • Computes the execution time along all possible execution paths. • State-space explosion problem. • Binary search: Check the WCRT is less than “x”. • Trades-off analysis time for precision. • Counter example: Execution trace for the WCRT. [P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.
Timing Analysis Proposed “fine-grained” Reachability approach: • Only consider local ticks that can execute together in the same global tick. • Timed execution trace for the WCRT. • To handle the state-space explosion: • Reduce the program’s CCFG before analysis. Reconstruct the program’s CCFG Program binary (annotated) Find all global ticks (Reachability) WCRT
Timing Analysis Programs executed on the following multicore architecture: Core 0 Core n Instruction memory Instruction memory Data memory Data memory TDMA Shared Bus Global memory
Timing Analysis Computing the execution time: • Overlapping of thread execution time from parallelism and inter-core synchronizations. • Scheduling overheads. • Variable delay in accessing the shared bus.
Timing Analysis • Overlapping of thread execution time from parallelism and inter-core synchronisations. • An integer counter to track each core’s execution time. • Synchronisation occurs when forking/joining, and ending the global tick. • Advance the execution time of participating cores. Core 1 Core 2 main f1 f2 Core 1:Core 2: main f2 f1
Timing Analysis • Scheduling overheads. • Synchronisation: Fork/join and global tick. • Via global memory. • Thread context-switching. • Copying of shared variables at the start the thread’s local tick via global memory. Core 1 Core 2 main f1 f2 Synchronisation Thread context-switch Global tick
Timing Analysis • Scheduling overheads. • Required scheduling routines statically known. • Analyse the scheduling control-flow. • Compute the execution time for each scheduling overhead. Core 1 Core 2 main Core 1 Core 2 main f1 f2 f1 f2