650 likes | 739 Views
Programming and Timing Analysis of Parallel Programs on Multicores. Eugene Yip , Partha Roop , Morteza Biglari-Abhari , Alain Girault ACSD 2013. Introduction. Safety-critical systems: Perform specific real-time tasks. Strict safety standards ( IEC 61508, DO 178 ) .
E N D
Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, ParthaRoop, MortezaBiglari-Abhari, Alain Girault ACSD 2013
Introduction • Safety-critical systems: • Perform specific real-time tasks. • Strict safety standards (IEC 61508, DO 178). • Time-predictability useful in real-time designs. • Shift towards multicore designs. Embedded Systems Safety-critical concerns [Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures. [Pellizzoni et al 2009] Handling Mixed-Criticality in SoC-Based Real-Time Embedded Systems. [Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.
Introduction • Designing safety-critical systems: • Certified Real-Time Operating Systems (RTOS) • E.g., VxWorks, LynxOS, and SafeRTOS. • Programmer manages shared variables. • Hard to verify timing. [VxWorks] http://www.windriver.com/products/vxworks/ [LynxOS] http://www.lynuxworks.com/rtos/rtos-178.php [SafeRTOS] http://www.freertos.org/FreeRTOS-Plus/Safety_Critical_Certified/SafeRTOS.shtml [Sandellet al 2006] Static Timing Analysis of Real-Time Operating System Code
Introduction • Designing safety-critical systems: • Certified Real-Time Operating Systems (RTOS) • E.g., VxWorks, LynxOS, and SafeRTOS. • Programmer manages shared variables. • Hard to verify timing. • Synchronous Languages • E.g., Esterel, Esterel C Language (ECL), and PRET-C. • Deterministic concurrency (Synchrony hypothesis). • Difficult to distribute: Instantaneous communication or sequential semantics. [Benveniste et al 2003] The Synchronous Languages 12 Years Later. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Girault 2005] A Survey of Automatic Distribution Method for Synchronous Programs
Research Objective • To design a C-based, parallel programming language that: • has deterministic execution behaviour, • can take advantage of multicore execution, and • is amenable to static timing analysis.
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
ForeC (Foresee) Language • C-based, multi-threaded, synchronous language. Inspired by Esterel and PRET-C. • Minimal set of synchronous constructs. • Fork/join parallelism and shared memory thread communication. • Structured preemption.
Execution Example Shared variable and its combine function shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } • Fork-join • Blocking statement. • Arbitrary thread execution order. Global synchronisation barrier
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Threads get a conceptual copy of the shared variables at the start of every global tick.
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Threads modify their own copy during execution.
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end When a global tick ends, the modified copies are combinedand assigned to the actual shared variables. Combine function is defined by the programmer and must be commutative and associative.
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end • Modifications are isolated. • Interleaving does not matter. • Do not need locks or critical sections. • But, the programmer has to specify the combine function and placement of pauses.
Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end Global tick start
ForeC (Foresee) Language int x = 1; abort { x = 2; pause; x = 3; }when (x > 0); ... Preemption construct: Initialise variable x Abort body starts executing. Check the abort condition. The abort body is preempted. Execution continues.
ForeC (Foresee) Language Preemption construct: [weak]abort{ st }when[immediate](cond) • immediate: The abort condition is checked when execution first reaches the abort. • weak: Let the abort body to execute one last time before it is preempted.
ForeC (Foresee) Language Variable type-qualifiers: inputand output • Declares a variable whose value is updated or emitted to the environment at each global tick. E.g., inputint x;
Scheduling Light-weight static scheduling: • Take advantage of multicore performance while delivering time-predictability (ease static timing analysis). • Thread allocation and scheduling order on each core decided at compile time by the programmer. • Cooperative (non-preemptive) scheduling. • Fork/join semantics and notion of a global tick is preserved via synchronisation.
Scheduling Light-weight static scheduling: • One core to perform housekeeping tasks at the end of the global tick. • Combining of shared variables. • Emitting outputs and sampling inputs. • Starting the next global tick.
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
Timing Analysis Compute the program’s Worst-Case Reaction Time (WCRT). WCRT = max(Reaction times) Must validate: WCRT ≤ Maximum time allowed Maximum time allowed (design specification) 1s 2s 3s 4s Physical time Reaction time [Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.
Timing Analysis Construct a Concurrent CFG (CCFG) of the executable binary. shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... }
Timing Analysis One existing approach for multicores: • [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors. • Uses ILP which is NP-Complete, no tightness result, analysis results are only for a 4-core processor. Existing approaches for single-core: • Integer Linear Programming (ILP) • Model Checking/Reachability • Max-Plus [P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.
Reachability • Traverse the CCFG to find all possible global ticks. • State-spaceexplosion. • Precision vs. Analysis time. g1 RT1 = Reaction Time of g1 g2 WCRT = MAX(RT1 … RT4c) RT2 g3a g3b RT3a RT3b g4a g4b g4c RT4a RT4c RT4b
Reachability Identify the path leading to the WCRT. Good for understanding the timing behaviour. g1 RT1 g2 RT2 g3a g3b RT3a RT3b g4a g4b g4c RT4a RT4c WCRT = RT4b
Max-Plus • Makes the safe assumption that the program’s WCRT occurs when all threads execute their longest reaction together. • Compute the WCRT of each thread separately. • Compute the program’s WCRT by using WCRT of the threads. • Fast analysis time but over-estimation could be large.
Timing Analysis Propose the use of Reachability for multicore analysis: • Trade off analysis time for higher precision. • Analyse inter-core synchronisations in detail. • Handle state-space explosion by reducing the program’s CCFG before reachability analysis. Program’s reduced CCFG Program binary (annotated) Compute each global tick. WCRT
Timing Analysis CCFG optimisations: • merge: Reduces the number of CFG nodes that need to be traversed. • merge-b: Reduces the number of alternate paths in the CFG. (Reduces the number of global ticks) merge merge-b
Timing Analysis • Computing each global tick: • Parallel thread execution and inter-core synchronisations. • Scheduling overheads. • Variable delay in accessing the shared bus.
Timing Analysis • Parallel thread execution and inter-core synchronisations. • An integer counter to track each core’s execution time. • Static scheduling allows us to determine the thread execution order on each core. • Synchronisation at fork/join, and end of the global tick. Core 1 Core 2 main f1 f2 Core 1:Core 2: main f2 f1
Timing Analysis • Scheduling overheads. • Synchronisation: Fork/join and global tick. • Via global memory. • Thread context-switching. • Copying of shared variables at the start the thread’s local tick via global memory. Core 1 Core 2 main f1 f2 Synchronisation Thread context-switch Global tick
Timing Analysis • Scheduling overheads. • Required scheduling routines statically known. • Analyse the control-flow of the routines. • Compute the execution time for each scheduling overhead. Core 1 Core 2 main Core 1 Core 2 main f1 f2 f1 f2
Timing Analysis • Variable delay in accessing the shared bus. • Global memory accessed by scheduling routines. • TDMA bus delay has to be considered. Core 1 Core 2 main f1 f2
Timing Analysis • Variable delay in accessing the shared bus. • Global memory accessed by scheduling routines. • TDMA bus delay has to be considered. Core 1 Core 2 1 slots Core 1 Core 2 2 main 1 1 1 1 1 2 2 2 2 2 f1 f2
Timing Analysis • Variable delay in accessing the shared bus. • Global memory accessed by scheduling routines. • TDMA bus delay has to be considered. main Core 1 Core 2 1 Core 1 Core 2 2 f1 f2 main 1 1 1 1 1 2 2 2 2 2 f1 f2
Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions
Results • For the proposed reachability-based timing analysis, we demonstrate: • the precision of the computed WCRT. • the efficiency of the analysis, in terms of analysis time.
Results • Timing analysis tool: Proposed Reachability Program CCFG (optimisations) Program binary (annotated) WCRT Max-Plus
Results • Multicore simulator (Xilinx MicroBlaze): • Based on http://www.jwhitham.org/c/smmu.htmland extended to be cycle-accurate and support multiple cores and a TDMA bus. 1 cycle Core 0 Core n Instruction memory Instruction memory 16KB Data memory Data memory 16KB TDMA Shared Bus 5 cycles/core (Bus schedule round = 5 * no. cores) Data memory 32KB 5 cycles
Results Benchmark programs. • Mix of control/data computations, thread structure and computation load. * * # * [Pop et al 2011] A Stream-Computing Extension to OpenMP. # [Nemer et al 2006] A Free Real-Time Benchmark.
Results • Each benchmark program was distributed over 1 to n-number of cores. • n = maximum number of parallel threads. • Observed the WCRT: • Input vectors to elicit the worst case execution path identified by Reachability analysis. • Computed the WCRT: • Reachability • Max-Plus
802.11a Results Observed: • WCRT decreases until 5 cores. • TDMA Bus is a bottleneck: Global memory becomes more expensive. • Synchronisation overheads.
802.11a Results Reachability: • ~2% over-estimation. • Benefit of explicit path exploration.
802.11a Results Max-Plus: • Assumes one global tick where all threads execute their worst-case. • Loss of thread execution context: Max execution time of the scheduling routines.
802.11a Results Both approaches: • Estimation of synchronisation cost is conservative. Assumed that the receive only starts after the last sender.
802.11a Results Max-Plus takes less than 2 seconds. Reachability
802.11a Results merge: • Reduction of ~9.34x Reachability
802.11a Results merge: • Reduction of ~9.34x Reachability