EMERALDS: a small-memory real-time microkernel

EMERALDS: a small-memory real-time microkernel Khawar M. Zuberi, Padmanabhan Pillai, and Kang G. Shin University of Michigan September 22, 2005 Seo, Dongmahn

Detail and recent Paper EMERALDS: A Small-Memory Real-Time Microkernel Khawar M. Zuberi, Microshoft Corp. Kang G. Shin, University of Michigan IEEE Trans. on Software Engineering, vol 27, no. 10, pp. 909-928, October 2001

Contents • Introduction • Embedded Application Requirements • Overview of EMERALDS • What Makes EMERALDS Different? • Combined Static/Dynamic Scheduler • Efficient Semaphore Implementation • State Messages for Intertask Communication • Memory Protection and System Calls • Performance Evaluation • Conclusions

Introduction • Real-time computing systems • predictability • real-time operating system (RTOS) • real-time tasks, deadlines • not block higher-priority tasks with lower-one or communication activities • wide variety of real-time environments • real-time application • from multimedia to industrial automation control • hardware • from single-board computer to distributed systems to multiprocessors

Introduction (cont) • RTOSs • Commercial RTOSs • pSOS • QNX • VxWorks • Research RTOSs • for multiprocessors • HARTOS • Spring Kernel • for distributed platforms • Harmony • RT-Mach

Introduction (cont) • real-time computing today • no longer limited to • high-powered, expensive applications • slow processors : tens of kilobytes • slow fieldbus networks : 1~2 Mbit/s bandwidth • 2 main reason for using such restricted hardware • To keep production costs down in mass-produced items such as home & portable electronics and automotive control • automotive engine & ABS controllers, cellular phones, camcorders • To keep weight and power consumption low in avionics and space applications

Introduction (cont) • RTOS kernel • about 20kbytes • services • task scheduling, system calls, interrupt handling • minimal overheads • EMERALDS • RTOS for a small-memory embedded systems • achieving efficiency • to rely not on carefully-crafted code but on new OS scheme and algorithms • focus on key OS services • Task scheduling, Semaphores, Intra-node message-passing, Memory protection and system call overhead

Embedded Application Requirements • target embedded application • use single-chip microcontrollers • with slow processing cores running at 15~25MHz • Motorola 68332, Intel i960 • all ROM and RAM on-chip, limit to 32~128kbytes • uniprocessor or distributed, 5~10 nodes • RTOS must provide • Task scheduling • Task synchronization (semaphore) • Task communication (message-passing) • Memory protection • Interaction with external environment (interrupt handling) • Clock and timer services

Task Scheduling • periodic task • 50~100 us OS scheduling operation on slow processor • 10~20 tasks, 3~5 tasks with period less than 10 ms • 10~15% of CPU time • problems • Schedules must be calculated by hand • difficult, costly to modify • Heuristics can be used, but non-optimal solution • Cyclic schedulers give poor response times for high-priority aperiodic tasks • consuming significant amouts of memory with short and long period tasks

Task Scheduling (cont) • priority-driven schedulers • rate-monotonic (RM) • earliest-deadline-first (EDF) • no off-line analysis, easy handle changes and handle aperiodic tasks • 10~15% of the CPU time overhead

Task Synchronization • OOP is ideal for designing real0time software • object modeling entities • internal data : physical state of the entity • temperature, pressure, position, RPM, etc. • methods : read or modified state • modeled by objects • sensors, actuators, controllers • real-time software • collection of threads of execution • invoking the methods of various objects • mutual exclusion • Semaphore • acquire & release • new and efficient schemes for implementing semaphore locking in EMERALDS

Task Communication • traditional mechanism, mailbox • 2 major disadvantages • 50~100us overhead for each message • several thousand messages per second is needed • no multiple message send • global variables used by application designers • subtle, hard-to trace, bugs in the software • new mechanisms for intertask communication • state message paradigm • protected global variables • optimized basic state message scheme • reduce execution overhead and memory consumption

Memory Protection • providing memory protection requires • maintaining page tables • programming memory management unit • problem • size of the kernel • additional overhead to several kernel • in embedded system • all process are cooperative and will never try to harm another process • BUT, bug in application code • TRAP to the kernel and recovery action • software fault-tolerance • in EMERALDS • kernel is mapped into each user-level address space

Overview of EMERALDS • microkernal RTOS written in the C++ language • EMERALDS’ salient features • Multi-threaded processes • Full memory protection between processes • Threads scheduled by the kernel. • IPC based on message-passing and mailboxes, Shared-memory support • Optimized local message passing • Semaphores and condition variables for synchronization; priority inheritance for semaphores • Support for communication protocol stacks • Highly optimized context switching and interrupt handling • Support for user-level device drivers

Overview of EMERALDS (cont) • small-sized kernel • less than 20 kbytes • no file system • only in-memory • no naming services • exchange short, simple messages over fieldbuses • talking directly to network device drivers • no built-in protocol stack • just 13 kbytes of code

What Makes EMERALDS Different? • general-purpose microkernel • Mack, L3, SPIN • focus on optimizing kernel services • thread management, IPC, virtual memory management • EMERALDS • no virtual memory • different sources of overhead from GPOS • thread management is same as GPOS • system call • enter protected kernel mode • call kernel procedure • low-overhead transition between user and kernel modes • provide efficient RT scheduling of thread • IPC : inter-node networking at user level • Task synchronization : interested in uniprocessor locking

Combine Static / DynamicScheduler • task scheduler overhead • run-time overhead • the time consumed by execution of scheduler code • schedulability overhead • 1- U* (ideal schedulable utilization) • EDF : U* = 1, high run-time overhead • RM : U* = 0.80 • static and dynamic priority schedulers’ B/W • dynamic one is better for aperiodic tasks • static one is better for guarantee for completion of critical tasks under processor overload situations

Run-time Overhead • run-time overhead • parsing queues of tasks • adding/deleting tasks from queues • blocking overhead ∆tb • selection overhead ∆ts • unblocking overhead ∆tu • run-time overhead per task τi = ∆tb + ∆tu + 2∆ts every period • a run-time overhead of • utilization

Run-time Overhead (cont) • EDF, ∆ts = O(n), twice • RM, ∆tb = O(n), once • ∆ts is less for RM than it is for EDF • especially when n is large (20 or more)

Schedulability Overhead

CSD: a Balance between EDF and RM • the Combined Static/Dynamic (CSD) scheduler • EDF and RM • run-time overhead of CSD is less than that of EDF, little more than that of RM. • 2 queues of tasks • dynamic-priority (DP) queue by EDF • fixed-priority (FP) queue by RM

Run-Time Overhead of CSD • Zero schedulability overhead of CSD • 4 cases for run-time overhead • DP task blocks ∆ts = O(r) ∆tb = O(1) • DP task unblocks ∆ts = O(r) ∆tu = O(1) • FP task blocks ∆ts = O(1) ∆tb = O(n-r) • FP task unblocks ∆ts = O(r) ∆tu = O(1) • total scheduler overhead for CSD • ∆tb + ∆ts_block + ∆tu + ∆ts_unblock per task block/unblock operation • for DP tasks, O(1) + O(r) + O(1) + O(r) = 2O(r) • for FP tasks, O(n-r) + O(1) + O(1) + O(r) = O(n) • significantly less than that of EDF • slightly greater than that of RM

Schedulability Test • EDF • RM • CSD • start by assuming r = 0 and perform the schedulability test • if successful, then stop, otherwise keep increasing r

Reducing Run-Time Overhead of CSD • main advantage of CSD • EDF, good schedulable utilization • by keeping the DP queue short • if workload increases • length of DP queue also increases • degrades performance of CSD • modified CSD • to keep run-time overhead under control • as the number of tasks n increases

Reducing Run-Time Overhead of CSD (cont) • Controlling DP Queue Run-Time Overhead • split DP queue into 2 queues DP1 and DP2 • CSD-3, since using 3 queues • Run-Time Overhead of CSD-3

Reducing Run-Time Overhead of CSD (cont) • Allocating Tasks to DP1 and DP2 • 2 factors • balancing of 2 queues • balancing the run-time overhead and scheduling overhead between queues • exhaustive search to find best possible allocation of tasks to DP1, DP2, and FP • schedulability test O(n2) times for three queues • 2~3 minutes on a 167MHz Ultra-1 Sun workstation for a workload with 100 tasks

Schedulability Test for CSD-3 • EDF, DP1 • EDF, DP2 • FP

Beyond CSD-3 • can be extended to have 4, 5, …, n queues • the best number of queues • the best number of tasks per queue • computationally-intensive task • the usefulness of the general CSD scheduling framework • beneficial in real systems

Efficient Semaphore Implementation • providing full semaphore semantics with priority inheritance • optimize implementation of these semaphores • by exploiting certain features of embedded applications

Standard Semaphore Implementation • standard procedure to lock a semaphore if (sem locked) { do priority inheritance; add caller thread to wait queue; block; /* wait for sem to be released */ } lock sem; • EDF • context switch overhead • focus on eliminating one or more context switches • FP • priority inheritance (PI) overhead • focus on optimization efforts on the PI operations

Implementation in EMERALDS • eliminate context switch • coder parser • add an extra parameter • optimize first PI • observation, parsing FP queue • optimize second PI • switch position when inherit in first PI operation

Applicability of the New Scheme • problems • may miss deadline • context switch is not saved • no benefit comes out of our semaphore scheme • problems can be resolves • Modification to the Semaphore Scheme • check if Semaphore is available or not • special queue associated with Semaphore • block before acquire_sem() • unblock after release_sem()

Applicability of the New Scheme (cont) • Applicability under Various Blocking Situations • 2 types of blocking • Blocking for Internal Events • Block for External Events • can be periodic or acperiodic

State Messages for Intertask Communication • global variables • ideal for sharing information between tasks • subtle bugs in the application code • State message • use global variables to pass messages • managed by code generated automatically by a software tool • mailbox-based message-passing interface • not replace traditional message-passing • efficient alternative to traditional message-passing

State-Message Semantics • State message • solve single-writer, multiple-reader communication problem • called SMmailboxes • differences of Smmailboxes • associated with writers • only one writer, multiple readers • new message overwrites previous message • reads do not consume messages • non-blocking reads and writes • reduce context switches

Usefulness • message • later one has more recent and up-to-date • one message is be associated with one task writes • reader task always get the most recent message • each time without blocking • valid, up-to-date, useful • single-writer, multiple-reader situation • blocking read operations are still necessary • task must wait for an event to occur • traditional message-passing and/or semaphores

Previous Work • State messages • were used in MARS OS, ERCOS • half-written message problem • solved by using an N-deep circular buffer for each state message • writer : post message pointer • reader : latest message pointer • memory consumption of large N • reduce N to no more than 5~10 for all possible cases

Implementation of State Messages in EMERALDS • Message • maximum number of bytes of CPU operation • B = 4 bytes • message length L • case of L ≤ B is simple • case of L > B • N-deep circular buffer to each state message • each slot in the buffer is L bytes long • index I • Calculating Buffer Depth : N = max(2, xmax+1) • slow readers : use system call

Memory Protection andSystem Calls

Memory Protection andSystem Calls (cont)

Performance Evaluation • implement on the Motorola 68040 processor • 13 kbytes of code size • 25 MHz • 5MHz on-chip timer • port to • PowerPC 505 • Super Hitachi 2 (SH2) • Motorola 68332 microcontroller • evaluated by • Scientific Research Laboratory • Ford Motor Company • focus on basic OS overhead with 9 commercial RTOSs

EMERALDS: a small-memory real-time microkernel

EMERALDS: a small-memory real-time microkernel

Presentation Transcript

Last Time Main memory indexing (T trees) and a real system. Optimize for CPU, space, and logging. But things have change

A survey of small real-time operating systems.

RTDroid : toward building dynamic real-time systems

Operating Systems

Precision Timed Embedded Systems Using TickPAD Memory

Timing

Memory

MEMORY

MEMORY

The PC’s Real-Time Clock

Memory System Performance

CSC 660: Advanced OS

Real Time System with MVL

Real Time Linux

Real-Time PCR

Lecture 15 Main Memory

361 Computer Architecture Lecture 14: Cache Memory

Virtual Memory

Real-Time Operating Systems

SYLLABUS

Memory

Last Time Main memory indexing (T trees) and a real system. Optimize for CPU, space, and logging.