830 likes | 1k Views
Predictable Integration of Safety-Critical Software on COTS- based Embedded Systems. Marco Caccamo University of Illinois at Urbana-Champaign. Outline. Motivation PRedictable Execution Model (PREM) Peripheral scheduler & real-time bridge Memory-centric scheduling MemGuard
E N D
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems Marco Caccamo University of Illinois at Urbana-Champaign
Outline • Motivation • PRedictable Execution Model (PREM) • Peripheral scheduler & real-time bridge • Memory-centric scheduling • MemGuard • Memory bandwidth Isolation • Colored Lockdown • Cache space management
Real-Time Applications • Resource intensive real-time applications • Multimedia processing(*), real-time data analytic(**), object tracking • Requirements • Need more performance and cost less Commercial Off-The Shelf (COTS) • Performance guarantee (i.e., temporal predictability and isolation) (*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010 (**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
Modern System-on-Chip (SoC) • More cores • Freescale P4080 has 8 cores • More sharing • Shared memory hierarchy (LLC, MC, DRAM) • Shared I/O channels More performance Less energy, Less cost But, isolation?
SoC: challenges for RT safety-critical systems • In a multicore chip, memory controllers, last level cache, memory, on chip network and I/O channels are globally shared by cores. Unless a globally shared resource is over provisioned, it must be partitioned/reserved/scheduled. Otherwise • Complexity, cost and schedule: The schedulability analysis, testing and temporal certification of an IMA partition in a core will also depend on tasks running in other cores • Safety Concerns: The change of software in one core could cause the tasks in other cores’ IMA partitions missing their deadlines. This is unacceptable!
Problem: Shared Memory Hierarchy • Shared hardware resources • OS has little control App 3 App 4 App 2 App 1 Core3 Core1 Core4 Core2 Space sharing Shared Last Level Cache (LLC) Access contention Memory Controller (MC) DRAM
Problem: Task-Peripheral conflict (1 core) CPU DDRAM Task A Task B This effect MUST be considered in wcet computation!! Front Side Bus Host PCI Bridge PCI Bus Master peripheral Slave peripheral Sebastian Schonberg, Impact of PCI-Bus Load on Applications in a PC Architecture, RTSS 03 • Task-peripheral conflict: • Master peripheral working for Task B. • Task A suffers cache miss. • Processor activity can be stalled due to interference at the FSB level. • How relevant is the problem? • Up to 49% increased wcetfor memory intensive tasks. • Contention for access to main memory can greatly increase a task worst-case computation time! 7
Experiment: Task and Peripherals • Experiment on Intel Platform, typical embedded system speed. • PCI-X 133Mhz, 64 bit fully loaded by traffic generator peripheral. • Task suffers continuous cache misses. • Up to 44% wcet increase. 8
Experiment: 2 Cores Interference • Task A suffers max number of cache misses (92% stall time). • Task B has variable cache stall time. • Adding PCI-E peripheral interference -> 196% WCET increase! Multicore interference is a serious problem!!! Max WCET increase ~= cache stall time of task A WCET increase proportional to cache stall time 9
Problem: Bus Contention • Two DMA peripherals transmitting at full speed on PCI-X bus. • Round-robin arbitration does not allow timing guarantees. CPU RAM 10
Problem: Bus Contention • Two DMA peripherals transmitting at full speed on PCI-X bus. • Round-robin arbitration does not allow timing guarantees. CPU RAM NO BUS SHARING 3 t 6 t 8 16 0 11
Problem: Bus Contention • Two DMA peripherals transmitting at full speed on PCI-X bus. • Round-robin arbitration does not allow timing guarantees. CPU RAM BUS CONTENTION, 50% / 50% 6 4 t 10 t 8 16 0 11
Problem: Bus Contention • Two DMA peripherals transmitting at full speed on PCI-X bus. • Round-robin arbitration does not allow timing guarantees. CPU RAM Integration Nightmare!!! BUS CONTENTION, 33% / 66% 9 t 9 t 8 16 0 11
Cache Delay Analysis (contention-based access) wcet increase • Compute worst case increase on task computation time due to peripheral interference (single core system). • Main idea: treat the memory subsystem as a switch that multiplexes accesses between the CPU and peripherals. • The same analysis was later extended to multicore platforms. Task Cache fetches Cache fetches t t wcet (no interfence) Bandwidth Peripherals R. Pellizzoni and M. Caccamo, "Impact of Peripheral-Processor Interference on WCET Analysis of Real-Time Embedded Systems" IEEE Transactions on Computers (TC), Vol. 59, No. 3, March 2010. t 12
Modeling I/O traffic: Peripheral Arrival Curve • Key idea: the maximum task delay depends on the amount of peripheral traffic (single core). • : maximum amount of time required by all peripherals to access main memory. • Can beobtainedusing… • Measurement • Distributedtrafficanalysis • Enforcedthroughengineeringsolution (more on thatlater…) 14
The Need for Engineering Solutions • Analysis bounds are tight but depend on very peculiar arrival patterns. • Average case significantly lower than worst case. • Main issue: COTS arbiters are not designed for predictability. • We propose engineering solutions to: • schedule memory accesses at high level (coarse granularity) memory-centric real-time scheduling, • control cores’ memory bandwidth usage, • manage cache space in a predictable manner 26
Outline • Motivation • PRedictable Execution Model (PREM) • Peripheral scheduler & real-time bridge • Memory-centric scheduling • MemGuard • Memory bandwidth Isolation • Colored Lockdown • Cache space management
Peripheral Scheduling • Solution: enforce peripheral schedule (single resource scheduling). • No need to know low-level parameters! CPU COTS peripherals do not provide block functionality, so how do we do this? RAM IMPLICIT SCHEDULE ENFORCEMENT 3 t BLOCK BLOCK t 8 16 0 28
Real-Time I/O Management System • Real-Time Bridge interposed between peripheral and bus. • RT-Bridge buffers incoming/outgoing data and delivers it predictably. • Peripheral Scheduler enforces traffic isolation. CPU Peripheral Scheduler RAM North Bridge PCIe RT Bridge RT Bridge RT Bridge RT Bridge ATA South Bridge PCI-X E. Betti, S. Bak, R. Pellizzoni, M. Caccamo and L. Sha, "Real-Time I/O Management System with COTS Peripherals" IEEE Transactions on Computers (TC), Vol. 62, No. 1, pp. 45-58, January 2013. 29
Peripheral Scheduler • Peripheral Scheduler receives data_rdyi information from Real-Time Bridges and outputs blocki signals. • Server provides isolation by enforcing a timing reservation. • Fixed priority, cyclic executive etc. can be implemented in HW with very little area. EXEC1 Scheduler (FP) data_rdy1 Server1 EXEC1 = READY1 block1 READY1 EXEC2 = READY2 and not EXEC1 EXEC2 data_rdy2 Server2 block2 READY2 . . . . . . EXECi = READYi and not EXEC1 … and not EXECi-1 EXECi data_rdyi Serveri blocki READYi . . . . . . 30
Real-Time Bridge • FPGA System-on-Chip design with CPU, external memory, and custom DMA Engine. • Connected to main system and peripheral through available PCI/PCIe bridge modules. Host CPU FPGA FPGA CPU Interrupt Controller Controlled Peripheral IntFPGA System + PCI PCI Bridge PCI Bridge PLB PCI Memory Controller DMA Engine IntMain Main Memory block Local RAM data_rdy 31
Real-Time Bridge • The controlled peripheral reads/writes to/from Local RAM instead of Main Memory (completely transparent to the peripheral). • DMA Engine transfers data from/to Main Memory to/from Local RAM. Host CPU FPGA FPGA CPU Interrupt Controller Controlled Peripheral IntFPGA System + PCI PCI Bridge PCI Bridge PLB PCI Memory Controller DMA Engine IntMain Main Memory block Local RAM data_rdy 32
Peripheral Virtualization • RT-Bridge supports peripheral virtualization. • Single peripheral (ex: Network Interface Card) can service different software partitions. • HW virtualization enforces strict timing isolation. 33
Implemented Prototype • Xilinx TEMAC 1Gb/s ethernet card (integrated on FPGA). • Optimized virtual driver implementation with no software packet copy (PowerPC running Linux). • Full VHDL HW code and SW implementation available. 34
Evaluation • 3 x Real-Time Bridges, 1 x Traffic Generator with synthetic traffic. • Rate Monotonic with Sporadic Servers. Utilization 1, harmonic periods. Generator RT-Bridge Scheduling flows without peripheral scheduler (block always low) leads to deadline misses! RT-Bridge RT-Bridge 35
Evaluation • 3 x Real-Time Bridges, 1 x Traffic Generator with synthetic traffic. • Rate Monotonic with Sporadic Servers. No deadline misses with peripheral scheduler Generator RT-Bridge RT-Bridge RT-Bridge 36
Testbed (single core, distributed) • Embedded testbed used to prove the applicability of our techniques. • System objective: control a 3DOF Quanser helicopter. • Non-linear control. • 100 Hz sensing and actuation. • End-to-end delay control using: • I/O Management System. • Real-Time Bridge 38
Testbed (single core, distributed) • Sensor Node performs sensing/actuation. • Control node executes control algorithm. • Data exchanged on real-time network. Quanser 3DOF helicopter RT Network Sensor Node Control Node 39
Testbed Control Node CPU Mem logic RAM Sensing / actuation node Peripheral Scheduler ADC/DAC Card RT Bridge RT Switch Actuation PCI RT NIC Card RT NIC Card NIC Disturb Sensing data Traffic Generator NIC GUI Node 40
Predictable Execution Model (PREM uni-core) • (The rule) Real-time embedded applications should be compiled according to a new set of rules to achieve predictability • (The effect)The execution of a task can be distinguished between a memory intensive phase (with cache prefetching) and a local computation phase (with cache hits) • (The benefit)High-level coscheduling can be enforced among all active components of a COTS system contention for accessing shared resources is implicitly resolved by the high-level coscheduler without relaying on low level arbiters R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, R. Kegley, "A Predictable Execution Model for COTS-based Embedded Systems", Proceedings of 17th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Chicago, USA, April 2011. 30
Memory-centric scheduling (multicore) • It uses the PREM task model: each task is composed by a sequence of intervals, each including a memory phase followed by a computation phase. • It enforces a coarse-grain TDMA schedule for granting memory access to each core. • Each core can be analyzed in isolation as if tasks were running on a “single-core equivalent ” platform. G. Yao, R. Pellizzoni, S. Bak, E. Betti, and M. Caccamo, "Memory-centric scheduling for multicore hard real-time systems", Real-Time Systems Journal, Vol. 48, No. 6, pp. 681-715, November 2012.
Two cores example: TDMA slot of core 1 J1 J2 J3 4 8 12 0 memory phase computation phase With a coarse-grained TDMA, tasks on one core can perform the memory access only when the TDMA slot is granted Core Isolation
Memory-centric scheduling: three rules • Assumption: fixed priority, partitioned scheduling • Rule 1: enforce a coarse-grain TDMA schedule among the cores for granting access to main memory; • Rule 2: raise scheduling priority of memory phases over execution phases when TDMA memory slot is granted; • Rule 3: memory phases are non-preemptive.
Raise priority of mem. phases during TDMA slot J1 J2 J3 4 8 12 0 memory phase computation phase J1 J2 J3
Make memory phases non-preemptive J1 J2 J3 4 8 12 0 J1 J2 J3 4 8 12 0
Summary of two cores example Rule 1 – TDMA memory schedule Rule 2 – Prioritize memory phases during a TDMA memory slot Rule 3 – memory phases are non-preemptive
Intuition of response time analysis Memory chain Execution chain J1 The linearized TDMA model: b is the memory bandwidth assigned to the core (b = TDMA_slot/ TDMA_period). each memory phase is inflated by a factor 1/b; each execution phase is inflated by a factor 1/(1-b); Interfering jobs that contribute to worst case response time can be separated as a memory chain followed by an execution chain; J2 J3 J4 J5 40 30 20 0 10
Pipelining memory and exec. phases Memory chain Execution chain J1 • key observations: • The inflated memory and execution phases can run in parallel. • Only ONE joint job contributes to both memory and execution chains (in this figure, J3 is the joint job). J2 J3 J4 J5 40 30 20 0 10
Worst-case response time of Job Ji 4. Computation of job under analysis 2. Memory blocking from one lower priority job 1. Upper bound of the memory phase of the joint job 3. Either memory or computation from hp(i) • Boththe memory and the computation of the joint job • Longest memory phase of one job with lower priority (due to non-preemptive memory) • The max of memory and computation phase for each higher priority job • The computation phase of the job under analysis
Schedulabilityof synthetic tasks Schedulability ratio Memory Util Core Util In an 8-core, 10-task system, the memory-centric scheduling bound is superior to the contention-based scheduling bound.
Schedulabilityof synthetic tasks Schedulability ratio Ratio = .5 Memory Util Core Util The contour line at 50% schedulable level
Outline • Motivation • PRedictable Execution Model (PREM) • Peripheral scheduler & real-time bridge • Memory-centric scheduling • MemGuard • Memory bandwidth Isolation • Colored Lockdown • Cache space management
Memory Interference • Key observations: • Memory bandwidth(variable) != CPU bandwidth (constant) • Memory controller queuing/access delay is unpredictable foreground X-axis background 470.lbm Foreground slowdown ratio (2.1GB/s) Core Core L2 L2 Shared Memory Intel Core2 (1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s)
Memory Access Pattern • Memory access patterns vary over time • Static resource reservation is inefficient LLC misses LLC misses Time(ms) Time(ms)