Time-Predictable Execution of Embedded Software on Multi-core Platforms

Time-Predictable Execution of Embedded Software on Multi-core Platforms SudiptaChattopadhyay under the guidance of A/P AbhikRoychoudhury

Embedded Systems

Real-time Constraints Hard real-time Embedded system Soft real-time

Timing Analysis • Hard real time systems require absolute timing guarantees • System level analysis • Single task analysis • Worst case execution time (WCET) analysis • An upper bound on execution time for all possible inputs • Sound over-approximation is obtained by static analysis

WCET Analysis WCET of basic blocks Infeasible path constraints Program WCET bound Micro-architectural modeling Loop bound Control flow graph constraints Path analysis

Architecture Core 1 Core n L1 cache L1 cache Shared bus Resource sharing Shared L2 cache Memory

Overview Instr. accesses Data accesses Shared cache + shared bus A multi-core WCET tool Shared cache Core 1 Core n L1 instruction cache L1 data cache Unified cache Processor L1 cache L1 cache L2 unified cache Dissertation work (Time-predictable execution in multi-core) Shared bus Resource sharing Bus Shared L2 cache Conflicts with different instruction and data memory blocks Main Memory Cache related preemption delay analysis Shared scratchpad allocation Coherence miss modeling Memory

Micro-architectural Modeling branch predictor shared cache cache pipeline shared bus Single Core Multi Core

Comparison

Imprecision in Abstract Interpretation p1 p2 young a b young b x Cache state = C2 Cache state = C1 Abstract cache set Abstract cache set Joined Cache state = C3 Joined cache state b Path p1 or path p2? Joined cache state loses information about path p1 and p2

Model Checking alone ? • A path sensitive search • Path sensitive search is expensive – path explosion • Worse, combined with possible cache states p1 p2 Cache state = C2 Cache state = C1

Model Checking alone ? • A path-sensitive search • Path sensitive search is expensive – path explosion • Worse, combined with possible cache states Abstract LRU cache set p1 p2 a b young young b x b young a young x b Abstract LRU cache set Abstract LRU cache set State Explosion

Cache analysis WCET of basic blocks All checked Cache analysis by abstract interpretation Pipeline analysis Analysis outcome Infeasible path constraints IPET Program Refine by model checker Branch predictor modeling Loop bound Timeout Micro architectural modeling constraints Refinement by model checker can be terminated at any point Model checker refinement steps are inherently parallel Path analysis Each model checker refinement step checks light assertion property

Refinement (Inter-core) m start Conflicting task Task x < y m1 m1 Infeasible x == y m2 m2 young ≠m m ≠m m exit cache Cache hit Cache miss Spurious

Refinement (Inter-core) start m Conflicting task Task x < y C_m++ m1 Increment conflict m1 Verified Infeasible x == y m2 C_m++ m2 Increment conflict young m m m exit cache assert (C_m <= 1) A Cache Hit

Refinement (Why it works?) m x < y Increment conflict C_m++ m’ Conflict to m m’ Path 2 x == y m Does not affect the value of C_m assert (C_m <= 0) m Cache miss Property

Experimental Setup (Chronos Toolkit) GCC simplescalar C source Binary code CFG Micro architectural modeling Flow constraints cache pipeline Branch prediction ILP WCET CBMC Micro-architectural constraints C bounded model checking

Experimental Result

Experimental Result WCET Direct-mapped, 256 bytes L1 cache L1 cache Average time = 70 secs Shared L2 cache 4-way associative, 8 KB

Extension Using Symbolic Execution unknown x < y Conflicting task x < y x ≥ y x < y C_m++ x = y x = y m1 Increment conflict m1 NO x == y m2 constraint solver C_m++ Increment conflict m2 x < y ˄ x = y satisfied assert (C_m <= 1) assert (C_m <= 1) abort

Extension Using KLEE GCC simplescalar C source Binary code CFG Micro architectural modeling Flow constraints cache pipeline Branch prediction ILP WCET CBMC/KLEE Micro-architectural constraints

A Generic Framework • Three different architectural/application settings High priority Low priority Task in Core 1 Task in Core 2 Cache conflict Cache conflict Cache conflict L1 cache cache cache L1 cache Intra task (WCET in single core) Inter task (Cache Related Preemption Delay analysis) Shared L2 cache Inter core (WCET in multi-core)

Micro-architectural Modeling branch predictor shared cache cache pipeline shared bus Single Core Multi Core

Task-level interference T1 T3 Tasks T2 T2 Core 1 Core n T1 L1 cache L1 cache Shared bus T2 Timeline Shared L2 cache T3 T1 T3 Task interference graph

Shared Cache + TDMA Shared Bus Task graphs Time Division Multiple Access (TDMA) T1 T3 T1 T3 Core 1 Core 2 Core 1 slot T2 T4 L1 cache L1 cache Shared bus Core 2 slot Bus access L2 miss due to T2 T4 Shared L2 cache T2 Disjoint lifetime Core 1 slot WAIT Bus access T4 T1 T2 Core 2 slot T3 T4

Overview of the framework L1 cache analysis L1 cache analysis Task interference monotonically decreases Filter Filter L2 cache analysis L2 cache analysis WCRT computation Bus aware analysis L2 conflict analysis Initial interference Yes Interference changes ? Estimated WCRT No

Evaluation (2-core) One core runs statemate another core runs the program under evaluation

Evaluation (4-core) Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores

Micro-architectural Modeling branch predictor shared cache Interactions cache pipeline shared bus Single Core Multi Core

Timing Anomaly (shared Cache) hit miss miss miss hit hit miss hit miss hit miss hit miss hit miss hit May not be the worst case path

Baseline Abstraction – Timing Interval • Representing each pipeline stage as a timing interval End = Start + cache miss latency interval start [1,3] finish [3,7] [4,10] latency EX WB R1 := R2 + 5 IF ID CM Structural dependency CM IF ID EX WB EX WB CM IF ID R5 := R1 * R7 IF ID EX WB CM Contention IF ID EX WB CM R3 := R5 * 5 A fixed-point analysis derives the timing of each stage as an interval

TDMA Shared Bus Analysis • Time Division Multiple Access (TDMA) • Offset abstraction Core 0 Core 1 Core 0 Core 1 Core 0 Core 1 Core 0 Core 1 delay = 0 offset delay offset round round T’ (core 0) T (core 1)

Loop Construct EX WB previous iteration IF ID CM CM IF ID EX WB EX WB CM current iteration IF ID IF ID EX WB CM How do we define bus context? Property: If the bus offsets of the cross-iteration edges do not change, WCET of the loop iteration cannot change

Loop Construct Ci = bus context of the loop body at i-th iteration C1 C2 C3 Bus context flow graph C4 C5 C5 C3 Property: If Ci Cj, then Ci+k  Cj+k for any k > 0

Loop Construct WCET of basic blocks Bus context flow graph C1 Infeasible path constraints C2 Program loop bound ILP solver Micro-architectural modeling C3 Loop bound Control flow graph Compute WCET for each bus context C4 E(C1) = number of times context C1 is executed Generate linear constraints: E(C1) + E(C2) + E(C3) + E(C4) ≤ loop bound E(C1) ≥ E(C2) constraints ILP = Integer Linear Programming Path analysis

Branch prediction + Cache Cache conflict Cache content m Branch location JOIN m Maximum number of speculated instructions m’ Cache content Unclear cache access

Experimental Setup (Chronos Toolkit) GCC simplescalar C source Binary code CFG Micro architectural modeling Flow constraints Private cache pipeline Branch prediction ILP WCET Shared cache Shared bus Micro-architectural constraints

Evaluation (cache + pipeline) Core 1 Imprecision of shared cache analysis Core 1 Core 2 Core 2 Horizontally partition Vertically partition jfdctint statemate

Evaluation (Cache + pipeline + Speculation) Imprecision of modeling speculation

Evaluation (Bus + pipeline) Imprecision of shared bus analysis Imprecision of path analysis

Recap PE-0 PE-1 PE-N …… c Shared cache + shared bus A multi-core WCET tool Shared cache Low priority task High priority task Task Core 1 Cache conflict Core n Unified cache SPM-0 SPM-1 SPM-N Core 1 Core n L1 data cache L1 data cache Fast on-chip communication media Coherence miss traffic Dissertation work (Time-predictable execution in multi-core) External Memory Interface Stale data items Shared bus L1 cache L1 cache Shared L2 cache Shared off-chip data bus Cache related preemption delay analysis Shared L2 cache Shared scratchpad allocation Coherence miss modeling Off-chip memory Memory

Perspective Time-predictable execution in single-core Resource sharing (cache and bus) Data sharing (cache coherence) Time-predictable execution in multi-core Testing Static analysis Customized hardware Shared cache Shared bus Cache coherence Shared scratchpad ARM Cortex A9 MPCore Samsung Exynos Nvidia Tegra II (smart phones) Time Division Multiple Access Aethreal Network-on-chip Sony PSP IBM Cell

Perspective Functionality Verification Quantitative Verification Concrete domain Concrete domain Abstract domain in abstract Interpretation (AI) Abstraction Anytime Verification of Quantitative properties SLAM (Microsoft) BLAST (UC Berkley) Property AI May be spurious MAGIC (CMU) Verifier Spurious counter example Generate Quantitative property Refinement Path-sensitive Verification Verified Abstraction refinement

Future Work Static performance analysis + testing Symbolic Execution x < y Performance testing x < y x ≥ y x < y x = y x = y Mobile devices m1 x == y Energy analysis of software x < y ˄ x ≠ y m2 Input abort Battery life Energy-aware software testing (Quantitative property e.g. cache conflict) assert (C_m <= 1)

Thank You My sincere thanks to all the Examiners and especially the anonymous Examiner 1 for his comment on symbolic execution

Time-Predictable Execution of Embedded Software on Multi-core Platforms

Time-Predictable Execution of Embedded Software on Multi-core Platforms

Presentation Transcript

Multi-sided platforms

Predictable Development of Reliable Embedded Systems

Software Platforms

Efficient Virtualization on Embedded Power Architecture Platforms

Predictable Integration of Safety-Critical Software on COTS- based Embedded Systems

Predictable Integration of Safety-Critical Software on COTS- based Embedded Systems

Orchestrating the Execution of Stream Programs on Multicore Platforms

Development and certification of Avionics Platforms on Multi-Core processors

SplitX : Split Guest/Hypervisor Execution on Multi-Core

Impact of Cache Partitioning on Multi-Tasking Real Time Embedded Systems

Predictable Design for Embedded Real-Time Systems †

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core

Multi-core programming frameworks for embedded systems

Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

Reliable Estimation of Execution Time of Embedded SW: A Statistical Approach

Dynamic Execution Core

Multi-sided platforms

Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

Time-Predictable Execution of Embedded Software on Multi-core Platforms

Reliable Estimation of Execution Time of Embedded SW: A Statistical Approach

Multi-core programming frameworks for embedded systems