450 likes | 458 Views
Time-Predictable Execution of Embedded Software on Multi-core Platforms. Sudipta Chattopadhyay under the guidance of A/P Abhik Roychoudhury. Embedded Systems. Real-time Constraints. Hard real-time. Embedded system. Soft real-time. Timing Analysis .
E N D
Time-Predictable Execution of Embedded Software on Multi-core Platforms SudiptaChattopadhyay under the guidance of A/P AbhikRoychoudhury
Real-time Constraints Hard real-time Embedded system Soft real-time
Timing Analysis • Hard real time systems require absolute timing guarantees • System level analysis • Single task analysis • Worst case execution time (WCET) analysis • An upper bound on execution time for all possible inputs • Sound over-approximation is obtained by static analysis
WCET Analysis WCET of basic blocks Infeasible path constraints Program WCET bound Micro-architectural modeling Loop bound Control flow graph constraints Path analysis
Architecture Core 1 Core n L1 cache L1 cache Shared bus Resource sharing Shared L2 cache Memory
Overview Instr. accesses Data accesses Shared cache + shared bus A multi-core WCET tool Shared cache Core 1 Core n L1 instruction cache L1 data cache Unified cache Processor L1 cache L1 cache L2 unified cache Dissertation work (Time-predictable execution in multi-core) Shared bus Resource sharing Bus Shared L2 cache Conflicts with different instruction and data memory blocks Main Memory Cache related preemption delay analysis Shared scratchpad allocation Coherence miss modeling Memory
Micro-architectural Modeling branch predictor shared cache cache pipeline shared bus Single Core Multi Core
Imprecision in Abstract Interpretation p1 p2 young a b young b x Cache state = C2 Cache state = C1 Abstract cache set Abstract cache set Joined Cache state = C3 Joined cache state b Path p1 or path p2? Joined cache state loses information about path p1 and p2
Model Checking alone ? • A path sensitive search • Path sensitive search is expensive – path explosion • Worse, combined with possible cache states p1 p2 Cache state = C2 Cache state = C1
Model Checking alone ? • A path-sensitive search • Path sensitive search is expensive – path explosion • Worse, combined with possible cache states Abstract LRU cache set p1 p2 a b young young b x b young a young x b Abstract LRU cache set Abstract LRU cache set State Explosion
Cache analysis WCET of basic blocks All checked Cache analysis by abstract interpretation Pipeline analysis Analysis outcome Infeasible path constraints IPET Program Refine by model checker Branch predictor modeling Loop bound Timeout Micro architectural modeling constraints Refinement by model checker can be terminated at any point Model checker refinement steps are inherently parallel Path analysis Each model checker refinement step checks light assertion property
Refinement (Inter-core) m start Conflicting task Task x < y m1 m1 Infeasible x == y m2 m2 young ≠m m ≠m m exit cache Cache hit Cache miss Spurious
Refinement (Inter-core) start m Conflicting task Task x < y C_m++ m1 Increment conflict m1 Verified Infeasible x == y m2 C_m++ m2 Increment conflict young m m m exit cache assert (C_m <= 1) A Cache Hit
Refinement (Why it works?) m x < y Increment conflict C_m++ m’ Conflict to m m’ Path 2 x == y m Does not affect the value of C_m assert (C_m <= 0) m Cache miss Property
Experimental Setup (Chronos Toolkit) GCC simplescalar C source Binary code CFG Micro architectural modeling Flow constraints cache pipeline Branch prediction ILP WCET CBMC Micro-architectural constraints C bounded model checking
Experimental Result WCET Direct-mapped, 256 bytes L1 cache L1 cache Average time = 70 secs Shared L2 cache 4-way associative, 8 KB
Extension Using Symbolic Execution unknown x < y Conflicting task x < y x ≥ y x < y C_m++ x = y x = y m1 Increment conflict m1 NO x == y m2 constraint solver C_m++ Increment conflict m2 x < y ˄ x = y satisfied assert (C_m <= 1) assert (C_m <= 1) abort
Extension Using KLEE GCC simplescalar C source Binary code CFG Micro architectural modeling Flow constraints cache pipeline Branch prediction ILP WCET CBMC/KLEE Micro-architectural constraints
A Generic Framework • Three different architectural/application settings High priority Low priority Task in Core 1 Task in Core 2 Cache conflict Cache conflict Cache conflict L1 cache cache cache L1 cache Intra task (WCET in single core) Inter task (Cache Related Preemption Delay analysis) Shared L2 cache Inter core (WCET in multi-core)
Micro-architectural Modeling branch predictor shared cache cache pipeline shared bus Single Core Multi Core
Task-level interference T1 T3 Tasks T2 T2 Core 1 Core n T1 L1 cache L1 cache Shared bus T2 Timeline Shared L2 cache T3 T1 T3 Task interference graph
Shared Cache + TDMA Shared Bus Task graphs Time Division Multiple Access (TDMA) T1 T3 T1 T3 Core 1 Core 2 Core 1 slot T2 T4 L1 cache L1 cache Shared bus Core 2 slot Bus access L2 miss due to T2 T4 Shared L2 cache T2 Disjoint lifetime Core 1 slot WAIT Bus access T4 T1 T2 Core 2 slot T3 T4
Overview of the framework L1 cache analysis L1 cache analysis Task interference monotonically decreases Filter Filter L2 cache analysis L2 cache analysis WCRT computation Bus aware analysis L2 conflict analysis Initial interference Yes Interference changes ? Estimated WCRT No
Evaluation (2-core) One core runs statemate another core runs the program under evaluation
Evaluation (4-core) Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores
Micro-architectural Modeling branch predictor shared cache Interactions cache pipeline shared bus Single Core Multi Core
Timing Anomaly (shared Cache) hit miss miss miss hit hit miss hit miss hit miss hit miss hit miss hit May not be the worst case path
Baseline Abstraction – Timing Interval • Representing each pipeline stage as a timing interval End = Start + cache miss latency interval start [1,3] finish [3,7] [4,10] latency EX WB R1 := R2 + 5 IF ID CM Structural dependency CM IF ID EX WB EX WB CM IF ID R5 := R1 * R7 IF ID EX WB CM Contention IF ID EX WB CM R3 := R5 * 5 A fixed-point analysis derives the timing of each stage as an interval
TDMA Shared Bus Analysis • Time Division Multiple Access (TDMA) • Offset abstraction Core 0 Core 1 Core 0 Core 1 Core 0 Core 1 Core 0 Core 1 delay = 0 offset delay offset round round T’ (core 0) T (core 1)
Loop Construct EX WB previous iteration IF ID CM CM IF ID EX WB EX WB CM current iteration IF ID IF ID EX WB CM How do we define bus context? Property: If the bus offsets of the cross-iteration edges do not change, WCET of the loop iteration cannot change
Loop Construct Ci = bus context of the loop body at i-th iteration C1 C2 C3 Bus context flow graph C4 C5 C5 C3 Property: If Ci Cj, then Ci+k Cj+k for any k > 0
Loop Construct WCET of basic blocks Bus context flow graph C1 Infeasible path constraints C2 Program loop bound ILP solver Micro-architectural modeling C3 Loop bound Control flow graph Compute WCET for each bus context C4 E(C1) = number of times context C1 is executed Generate linear constraints: E(C1) + E(C2) + E(C3) + E(C4) ≤ loop bound E(C1) ≥ E(C2) constraints ILP = Integer Linear Programming Path analysis
Branch prediction + Cache Cache conflict Cache content m Branch location JOIN m Maximum number of speculated instructions m’ Cache content Unclear cache access
Experimental Setup (Chronos Toolkit) GCC simplescalar C source Binary code CFG Micro architectural modeling Flow constraints Private cache pipeline Branch prediction ILP WCET Shared cache Shared bus Micro-architectural constraints
Evaluation (cache + pipeline) Core 1 Imprecision of shared cache analysis Core 1 Core 2 Core 2 Horizontally partition Vertically partition jfdctint statemate
Evaluation (Cache + pipeline + Speculation) Imprecision of modeling speculation
Evaluation (Bus + pipeline) Imprecision of shared bus analysis Imprecision of path analysis
Recap PE-0 PE-1 PE-N …… c Shared cache + shared bus A multi-core WCET tool Shared cache Low priority task High priority task Task Core 1 Cache conflict Core n Unified cache SPM-0 SPM-1 SPM-N Core 1 Core n L1 data cache L1 data cache Fast on-chip communication media Coherence miss traffic Dissertation work (Time-predictable execution in multi-core) External Memory Interface Stale data items Shared bus L1 cache L1 cache Shared L2 cache Shared off-chip data bus Cache related preemption delay analysis Shared L2 cache Shared scratchpad allocation Coherence miss modeling Off-chip memory Memory
Perspective Time-predictable execution in single-core Resource sharing (cache and bus) Data sharing (cache coherence) Time-predictable execution in multi-core Testing Static analysis Customized hardware Shared cache Shared bus Cache coherence Shared scratchpad ARM Cortex A9 MPCore Samsung Exynos Nvidia Tegra II (smart phones) Time Division Multiple Access Aethreal Network-on-chip Sony PSP IBM Cell
Perspective Functionality Verification Quantitative Verification Concrete domain Concrete domain Abstract domain in abstract Interpretation (AI) Abstraction Anytime Verification of Quantitative properties SLAM (Microsoft) BLAST (UC Berkley) Property AI May be spurious MAGIC (CMU) Verifier Spurious counter example Generate Quantitative property Refinement Path-sensitive Verification Verified Abstraction refinement
Future Work Static performance analysis + testing Symbolic Execution x < y Performance testing x < y x ≥ y x < y x = y x = y Mobile devices m1 x == y Energy analysis of software x < y ˄ x ≠ y m2 Input abort Battery life Energy-aware software testing (Quantitative property e.g. cache conflict) assert (C_m <= 1)
Thank You My sincere thanks to all the Examiners and especially the anonymous Examiner 1 for his comment on symbolic execution