220 likes | 280 Views
Improving IPC by Kernel Design. Jochen Liedtke Proceeding of the 14 th ACM Symposium on Operating Systems Principles Asheville, North Carolina 1993. The Performance of u-Kernel-Based Systems. H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter
E N D
Improving IPC by Kernel Design Jochen Liedtke Proceeding of the 14th ACM Symposium on Operating Systems Principles Asheville, North Carolina 1993
The Performance of u-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Proceedings of the 16th Symposium on Operating Systems Principles October 1997, pp. 66-77
Jochen Liedtke (1953 – 2001) • 1977 – Diploma in Mathematics from University of Beilefeld. • 1984 – Moved to GMD (German National Research Center). Build L3. Known for overcoming ipc performance hurdles. • 1996 – IBM T.J Watson Research Center. Developed L4, a 12kb second generation microkernel.
The IPC Dilemma • Inter-process communication (ipc) by message passing is one of the central paradigms of u-kernel and client / server architectures. • Increase modularity, flexibility, security and scalability. • But, most ipc implementations of the time performed poorly (1st generation micro-kernels such as Mach or Chorus). Really fast message passing systems were needed to run device drivers and other performance critical components at the user-level. • So, programmers started to circumvent ipc. For example, co-locating device drivers and other components back into the kernel. • To gain acceptance, ipc has to become a very efficient basic mechanism.
What to Do? • The author sets out to construct a u-kernel that will achieve a tenfold improvement in ipc performance over comparable systems. • “ipc performance is the master” is a key design principle. • Result is L3 is micro-kernel based operating system built by GMD (German National Research Center for Computer Science) and finally L4. • Use a synergistic approach, no single “silver bullet” exists.
Summary of Techniques Seventeen Total
Measured Performance Gains • Note synergistic effect. For 8-byte ipc; • 49% + 23% + 21% + 18% + 13% + 10% = 134% • 49% means that that removing that item would increase ipc time by 49%.
Client (Sender) Server (Receiver) L4_ipc_send ( ); system call, Enter kernel Exit kernel L4_ipc_receive ( ); system call, Enter kernel Exit kernel Client is not Blocked L4_ipc_send ( ); system call, Enter kernel Exit kernel L4_ipc_receive ( ); system call, Enter kernel Exit kernel Standard System Calls (Send, Receive) Kernel entered and exited four times, 107 cycles each time.
Client (Sender) Server (Receiver) L4_ipc_call ( ); system call, Enter kernel Allocate Processor to Server Suspend L4_ipc_reply_and_wait ( ); Resume from being suspended Return to user (exit kernel) Client IS Blocked Inspect message L4_ipc_reply_and_wait ( ); Enter kernel Send Reply Wait for next message L4_ipc_receive ( ); system call, Processor allocate to Client Exit kernel Add New System Calls Kernel entered and exited two times, half as much.
Complex Message Structure Combine a sequence of send operations into a single operation by supporting complex messages. • Benefit: reduces number of sends.
Direct Transfer by Temporary Mapping • LRPC and RPC share user level memory of client and server to transfer messages. But this may effect security. • Other micro-kernels transfer messages by a twofold copy, process A space into kernel space into process b space. • L4 provides single-copy transfers by temporarily sharing the target region with the sender.
Scheduling, Conventional • Conventionally, ipc operations call or reply & receive requires scheduling actions: • Delete sending thread from the ready queue. • Insert sending thread into the waiting queue • Delete the receiving thread from the waiting queue. • Insert receiving thread into the ready queue. • These operations, together with 4 expected TLB misses will take at least 1.2 us (23%T).
Solution, Lazy Scheduling • Conventional IPC requires updating of thread scheduler queues. Performance can be improved by delaying the movement of threads within/between queues until the queues are queried. This ``lazy'' scheduling is achieved by setting state flags (ready / waiting) in the Thread Control Blocks (tcb – contains basic information about a thread) and then scanning queues at query time for threads which should be moved to different queues.
Pass Short Messages in Register • Typically, a high proportion of messages are very short, 8 bytes (plus 8 bytes of sender id). Examples would be ack/error replies from device drivers or hardware initiated interrupt messages. • The 486 processor had enough registers to allow direct transfer of short messages via cpu registers. • Performance gain of 2.4 us or 48%T.
IPC Performance • For an eight byte message, ipc time for L3 is 5.2 us compared to 115 us for Mach, a 22 fold improvement. • For large message (4K) a 3 fold improvement is seen.
Conclusion • Use a synergistic approach to achieve greater ipc performance, a single “silver bullet” may not exist. • A thorough understanding of the interaction between the hardware architecture and the operating system is key to many of the improvements. Microkernels are not portable between hardware architectures. • L4 demonstrated the viability of running applications on top of a micro-kernel.
References • http://i30www.ira.de/aboutus/people/liedtke/inmemoriam.php • Microkernels; Ulfar Erlingsson, Athanasios Kyparlis • Monolithic Kernel vs. Microkernel; Benjamin Roch; TU Wien