420 likes | 587 Views
Toward the Predictable Integration of Real-Time COTS Based Systems. Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign. Acknowledgement. Part of this research is a joint work with prof. Lui Sha This presentation is from selected research sponsored by
E N D
Toward the Predictable Integration of Real-Time COTS Based Systems Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign
Acknowledgement • Part of this research is a joint work with prof. Lui Sha • This presentation is from selected research sponsored by • National Science Foundation • Lockheed Martin Corporation • Graduate students who led these research efforts were: • Rodolfo Pellizzoni • Bach D. Bui References • R. Pellizzoni, B.D. Bui, M. Caccamo and L. Sha, "Coscheduling of CPU and I/O Transactions in COTS-based Embedded Systems," To appear at IEEE Real-Time Systems Symposium, Barcelona, December 2008. • R. Pellizzoni and M. Caccamo, "Toward the Predictable Integration of Real-Time COTS based Systems", Proceedings of the IEEE Real-Time Systems Symposium, Tucson, Arizona, December 2007.
COTS HW & RT Embedded Systems • Embedded systems are increasingly built by using Commercial Off-The-Shelf (COTS) components to reduce costs and time-to-market • This trend is true even for companies in the safety-critical avionic market such as Lockheed Martin Aeronautics, Boeing and Airbus • COTS components usually provide better performance: • SAFEbus used in the Boing777 transfers data up to 60 Mbps, while a COTS interconnection such as PCI Express can reach higher transfer speeds (over three orders of magnitude) • COTS components are mainly optimized for the average case performance and not for the worst-case scenario.
I/O Bus Transactions & WCETs • Experiment based on an Intel Platform, typical embedded system speed. • PCI-X 133Mhz, 64 bit fully loaded. • Task suffers continuous cache misses. • Up to 44% wcet increase. This is a big problem!!!
ARINC 653 and unpredictable COTS behaviors • According to ARINC 653 avionic standard, different computational components should be put into isolated partitions (cyclic time slices of the CPU). • ARINC 653 does not provide any isolation from the effects of I/O bus traffic. A peripheral is free to interfere with cache fetches while any partition (not requiring that peripheral) is executing on the CPU. • To provide true temporal partitioning, enforceable specifications must address the complex dependencies among all interacting resources. See Aeronautical Radio Inc. ARINC 653 Specification. It defines the Avionics Application Standard Software Interface.
Peripheral Integration: Problem Scenario CPU DDRAM Task A Task B This effect MUST be considered in wcet computation!! Front Side Bus Host PCI Bridge PCI Bus Master peripheral Slave peripheral Sebastian Schonberg, Impact of PCI-Bus Load on Applications in a PC Architecture, RTSS 03 • Cache-peripheral conflict: • Master peripheral working for Task B. • Task A suffers cache miss. • Processor activity can be stalled due to interference at the FSB level. • How relevant is the problem? • Four high performance network cards, saturated bus. • Up to 49% increased wcet for memory intensive tasks.
Goal: End-to-End Temporal Isolation on COTS • To achieve end-to-end temporal isolation, shared resources (CPU, bus, cache, peripherals, etc.) should either support strong isolation or temporal interference should be quantifiable. • Highly pessimistic assumptions are often made to compensate for the lack of end-to-end temporal isolation on COTS • An example is to account for the effect of all peripheral traffic in the wcet of real-time tasks (up to 44% increment in task wcet)! • Lack of end-to-end temporal isolation raises dramatically integration costs and is source of serious concerns during the development of safety critical embedded systems • At integration time (last phase of the design cycle), testing can reveal unexpected deadline misses causing expensive design rollbacks
Goal: End-to-End Temporal Isolation on COTS • It is mandatory to have a closer look at HW behavior and its integration with OS, middleware, and applications • We aim at analyzing temporal interference caused by COTS integration • if analyzed performance is not satisfactory, we search for alternative (non-intrusive) HW solutions see Peripheral Gate
Main Contributions We introduced an analytical technique that computes safe bounds on the I/O-induced task delay (D). To control I/O interference over task execution, we introduced a coscheduling technique for CPU & I/O Peripherals We designed a COTS-compatible peripheral gate and hardware server to enable/disable I/O peripherals (hw server is in progress!)
Power PC clocked @ 1000 MHz CPU+Multi- Level Cache Digital Video 64 Bit Wide Memory Bus 256 MB DDR SDRAM Clocked @ 125 MHz System Controller Shared Memory Ethernet GraphicsProcessor MPEG Comp. RS-485 32 Bit PCI Clocked @ 33 MHz 32 Bit PCI Clocked @ 66 MHz PCI Bus 1a PCI Bus 1b PCI to PCI Bridge PCI-X to PCI Bridge 32 Bit PCI Clocked @ 66 MHz 64 Bit PCI-X Clocked @ 100 MHz PCI Bus 0a PCI Bus 0b Network Interface Network Interface Network Interface IO Port 2 Port 1 Inactive IEEE 1394 Network Interface Discrete IO Fibre Channel Network Interface Copper Fibre Channel Network Interface The cache-peripheral interference problem • Pipelined, cached CPUs. • Master (DMA) peripherals. • Etc. • Modern COTS-based embedded architectures are multi-master platforms • We assume a shared memory architecture with single-port RAM • We will show safe bounds for cache-peripheral interference at the main memory level. COTS are inherently unpredictable due to:
Peripheral Burstiness Bound • Similar to network calculus approach. • : maximum cumulative bus time required in any interval of length t. • How to compute: • Measurement. • Knowledge of distributed traffic. • Assumptions: • Maximum non preemtive transaction length: L’ • No buffering in bridges (the analysis was extended in presence of buffering too!).
Cache Miss Profile flat curve: CPU executing Bus time t increasing curve (slope 1): CPU stalled during cache line fetch • : cumulative bus time required to fetch/replace cache lines in . • Note: not an upper bound! • Assumptions: • CPU is stalled while waiting for lvl2 cache line fetch (no hyperthreading). • How to compute: • Static analysis. • Profiling. • Profiling yields multiple traces, run delay analysis on all.
Cache Delay Analysis CPU Cache misses Cache misses t t wcet (no I/O interference) wcet increament (D) PCI Periph. Baund. t • The proposed analysis computes worst case increase (D) on task computation time due to cache delays caused by FSB interference. • Main idea: treat the FSB + CPU cache logic as a switchthat multiplexes accesses to system memory. • Inputs: Cache line misses over time and peripheral bandwidth. • Output: Curve representing the delayed cache misses. • Bus arbitration is assumed RR or FP, transactions are non preemptive.
Analysis: Intuition (1/2) CPU : cache miss cache line length PCI max transaction length t • Worst case situation: PCI transaction accepted just before CPU cache miss. • Worst case interference: min ( CM, PT/L’ ) * L’ • CM: # of cache misses • PT: total peripheral traffic during task execution • Assuming RR bus arbitration
Analysis: Intuition (2/2) these CPU memory accesses can not be delayed CPU PCI these peripheral transactions can not delay the CPU T T T T T t The analysis shown is pessimistic; cache misses exhibit burst behavior. Example: assume 1 peripheral transaction every T time units. Real analysis: compute exact interference pattern based on burstiness of cache misses and peripheral transactions.
Worst Case Interference Scenario Fetch start time in the cache access function c(t) unmodified by peripheral activity 10 5 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 40 40 45 50 55 CACHE CPU • Worst case situation: peripheral transaction of length L’ accepted just before CPU cache miss.
Bound: Cache Misses D 0 5 10 15 20 25 30 35 40 45 50 55 CACHE CPU PERIPHERAL D • Cache Bound: max number of interfering peripheral trans. = number of cache misses. • Let CM be the number of cache misses. • Then .
Bound: Peripheral Load D 0 5 10 15 20 25 30 35 40 45 50 55 CACHE CPU PERIPHERAL D In general, given a set of fetches {fi,…,fj} with start times {ti,…,tj} D E(tj-ti) Peripheral Bound: max interference D max bus time requested by peripherals in interval . Let . Then equivalently:
Some Insights about Peripheral Bound There is a circular dependency between the amount of peripheral load that interferes with {fi,…,fj} and the delay D(fi, fj). When peripheral traffic is injected on the FSB, the start time of each fetch is delayed. In turn, this increases the time interval between fi and fj and therefore more peripheral traffic can now interfere with those fetches. Our key idea is that we do not need to modify the start times {fi,…,fj} of fetches when we take into account the I/O traffic injected on the FSB. Instead, we take it into account using the equation that defines
Some Insights about Peripheral Bound • represents both the maximum delay suffered by fetches within [0-36] and the increase in the time interval for interfering traffic. Fetches in interval [0-36] max interference D
The Intersection is not Tight! 10 5 0 5 10 15 20 25 30 35 40 45 50 45 CACHE PERIPHERAL 15 This trans. can not interfere! 10 E(t5-t1+D) = 14 5 D 0 0 5 10 15 20 25 30 35 40 45 50 45 The real worst case delay is 13! Reason: cache is too bursty, interference from one peripheral trans. is “lost” while the cache is not used.
The Intersection is not Tight! 10 5 0 5 10 15 20 25 30 35 40 45 50 45 CACHE PERIPHERAL 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 45 • Solution: split into multiple intervals. • . • How many intervals do we need to consider?
Delay Algorithm max delay for miss 1 (u1) max delay for miss 2 (u2) max delay for miss 3 (u3) . . . max delay for miss 4 (u4) Iterative algorithm evaluates N(N+1)/2 intervals. Each interval computed in O(1), overall complexity O(N2). Bound is tight (see RTSS’07).
Multitasking analysis Multitasking analysis using cyclic executive (it was extended to EDF with restricted-preemption model). Analyze task Control Flow Graph. Build a set of sequential superblocks. Schedule is interleaving of slots composed of superblocks. Algorithm: compute number of superblocks in each slot. Account for additional cache misses due to inter-task cache interference.
Great! But c(t) is hard to get... and 44% is awful The proposed analysis makes a fairly restrictive assumption: it must know the exact time of each cache miss. I/O interference is significant: when added to the wcet of all tasks, the system can suffer a huge waste of bandwidth! Key idea: let’s coschedule CPU & I/O Peripherals Goal: allow as much peripheral traffic as possible at run-time while using CPU reservations that do NOT include I/O interference (D).
Cache Miss Profile is Hard to Get start • Problem: obtaining an exact cache miss pattern is very hard. • CPU simulation requires simulating all peripherals. • Static analysis scales poorly. • In practice testing is often the preferred way. • Our solution: • Split the tasks into intervals. • Insert a checkpoint at the end of each interval. • Measure wcet and worst case # of cache misses for each interval (with no peripheral traffic). • Checkpoints should not break loops or branches (sequential macroblock boundaries). checkpoint checkpoint checkpoint checkpoint checkpoint checkpoint
CPU & I/O coscheduling: HOW TO A coscheduling technique for COTS peripherals divide each task into a series of sequential superblocks; Run off-line profiling for each task, collecting information on wcet and # of cache misses in each superblock (without I/O interference); Compute a safe (wcet+D) bound (it includes I/O interference) for each superblock by assuming a “critical cache miss pattern” Design a peripheral gate (p-gate) to enable/disable I/O peripherals Design a new peripheral (on FPGA board), the reservation controller, which executes the coscheduling algorithm and controls all p-gates. Use profiling information at run-time to coschedule tasks and I/O transactions
Analysis with Interval Information • Input: a set of intervals with wcet and cache misses. • Since we do not know when each cache miss happens within each interval, we need to identify a worst case pattern. wcet1 wcet2 wcet3 wcet4 wcet5 CM1 CM2 CM3 CM4 CM5 This is actually the worst case pattern! • If the Peripheral Load Curve is concave, then we obtain a tight bound for delay D (details are in a technical report). • If the Peripheral Load Curve is not concave, the bound for delay D is not tight. Simulations showed that the upper bound is within 0.2% of the real worst case delay. Bus time t Bus time t Bus time t CMi =4 wceti
On-line Coscheduling Algorithm • The on-line algorithm: • Non-safety critical tasks have CPU reservation = wcet (D NOT included!) • At the beginning of each job the p-gates are closed. • At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller. • The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate. wcet1 wcet2 wcet3 wcet4 wcet5 Task total wcet
Coscheduling algorithm: an example • The on-line algorithm: • Non-safety critical tasks have CPU reservation = wcet (D NOT included!) • At the beginning of each job the p-gates are closed. • At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller. • The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate. wcet1 wcet2 wcet3 wcet4 wcet5 Initial slack = 0 => p-gate closed
Coscheduling algorithm: an example • The on-line algorithm: • Non-safety critical tasks have CPU reservation = wcet (D NOT included!) • At the beginning of each job the p-gates are closed. • At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller. • The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate. wcet1 wcet2 wcet3 wcet4 wcet5 wcet2 + D2 exec1 Slack += wcet1 -exec1 Slack < D2 p-gate closed
Coscheduling algorithm: an example • The on-line algorithm: • Non-safety critical tasks have CPU reservation = wcet (D NOT included!) • At the beginning of each job the p-gates are closed. • At run time, at each checkpoint the OS sends the APMC # of CPU cycles (execi) to the reservation controller. • The reservation controller keeps track of accumulated slack time. If slack time i (wceti-execi) is greater than delay D for next interval, open the p-gate. wcet1 wcet2 wcet3 wcet4 wcet5 wcet3 + D3 exec1 exec2 Slack += wcet2 – exec2 Slack >= D3 p-gate open
System Integration: example for avionic domain t Class A: safety critical (e.g., flying control) Class B: mission critical (e.g., radar processing) Class C: non critical (e.g., display) • System composed of tasks/partitions with different criticalities: each task/partition uses different I/O peripherals. • The right action depends on the task/partition criticality • Class A: block all non relevant peripheral traffic (Reservation=wcet+D) • Class B: coschedule tasks and peripherals to maximize I/O traffic (Reservation=wcet). • Class C: all I/O peripherals are enabled
Peripheral Gate • We designed the peripheral gate (or p-gate for short) for the PCI/PCI-X bus: it allows us to control peripheral access to the bus. • The peripheral gate is compatible with COTS devices: its use does not require any modifications to either the peripheral or the motherboard.
Peripheral Gate Processes #1,#2,#3 belong to class A time P#2 Reservation Controller P#1 P#3 executing P#2 executing P#1 executing P#3 cpu schedule CPU FSB logic RAM Reservation Controller Peripheral Gate Peripheral Gate Peripheral Bus Reservation controller commands Peripheral Gate (p-gate). Kernel sends scheduling information to Reservation Controller. Minimal kernel modification (send PID and exec of executing process). Class A task: block all non relevant peripheral traffic Class B task: reservation controller implements coscheduling algorithm.
Current Prototype Logic analyzer for debugging and measurament P-gate Gigabit ethernet NIC Reservation Controller (Xilinx FPGA) Testbed uses standard Intel platform. Reservation controller implemented on FPGA, p-gate uses PCI extender card + discrete logic.
Kernel Implementation • Getting this information requires support from the CPU and the OS. • We used Architectural Performance Monitor Counters for the Intel Core2 microarch, but other manifacturers (ex: IBM) have similar support (implementation is specific, the lesson is general). • Two APMCs configured to count cache misses and CPU cycles in user space. • Task descriptor extended with exec. time and cache miss fields. • At context switch, the APMCs are saved/restored in descriptors like any other task-specific CPU registers. • Implemented under Linux/RK.
Other Coscheduling Algorithms • We compared our adaptive heuristic with other algorithms. • Assumption: At the beginning of each interval the algorithm chooses whether to open or close the switch for that interval. • Slack-only: baseline comparison, uses only remaining slack time when task has finished. • Predictive: • Also uses measured average exec times. • “Predicts” slack time in the feature and optimizes open intervals at each step. • Computing an optimal allocation is NP-hard, instead it uses a fast greedy heuristic. • Optimal: • Clairvoyant (not implementable). • Provides an upper bound to the performance of any run-time, predictive algorithm.
The Test • All run-time algorithms implemented on Xilinx ML505 FPGA. • Optimal computed using Matlab optimization tool. • We used a mpeg decoder as benchmark. • As a trend, video processing is increasingly used in the avionic domain for mission control. • It simulates a Class B application subject to heavy I/O traffic • The task misses its deadline by up to 30% if I/O traffic is always allowed! • The run-time algorithm is already close to the optimal; not much to gain with the improved heuristic. Results in term of % time the p-gate is open
Simulation Results • We performed synthetic simulations to better understand the performance of the run-time algorithm. • 20 superblocks per task, • α is the variation between wcet and avg computation time. • β is the % of time the task is stalled due to cache misses. α β
Improving the P-Gate: Hardware Server (in progress) • FPGA-based SoC design with Linux device drivers. • Currently in development. Xilinx FPGA peripheral PCI interface CPU DRAM DDRAM Mem Bridge OPB interrupt controller PCI Host bridge • Problem: blocking the peripheral reduces maximum throughput. • Ok only if critical tasks/partitions run for limited amount of time. • Better solution: implement a hardware server with buffering on SoC • Transactions are queued in hw server’s memory during non relevant partitions. • Interrupts/DMA transfers are delivered only during execution of interested tasks/partitions • Similar to real-time aperiodic servers: a hw server permits aperiodic I/O requests to be analyzed as if they were following a predictable (periodic) pattern
Conclusions • A major issue in peripherals integration is task delay due to cache-peripheral contention at the main memory level • We proposed a framework to: 1) analyze the delay due to cache peripheral contention ; 2) control task execution times. • The proposed co-scheduling technique was tested with PCI/PCI-X bus; hw server will be ready soon. • Future work: Extend to multi-processor and distributed systems