1 / 26

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems. G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti § , Yoshio Turner HP Labs §: Currently at Meiosys, Inc. Broad Opportunity for Checkpoint-Restart in Server Management.

cosima
Download Presentation

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti§, Yoshio Turner HP Labs §: Currently at Meiosys, Inc.

  2. Broad Opportunity for Checkpoint-Restart in Server Management • Fault tolerance (minimize unplanned downtime) • Recover by restarting from checkpoint • Minimize planned downtime • Migrate application before hardware/OS maintenance • Resource management • Manage resource allocation in shared computing environments by migrating applications

  3. Need for General-Purpose Checkpoint-Restart • Existing checkpoint-restart methods are too limited: • No support for many OS resources that commercial applications use (e.g., sockets) • Limited to applications using specific libraries • Require application source and recompilation • Require use of specialized operating systems • Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications

  4. Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux • Application-transparent: supports applications without modifications or recompilation • Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps) • Comprehensive support for user-level state, kernel-level state, and distributed computation and communication state • Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module

  5. Cruz Overview Builds on Columbia Univ.’s Zap process migration Our Key Extensions • Support for migrating networked applications, transparent to communicating peers • Enables role in managing servers running commercial applications (e.g., databases) • General method for checkpoint-restart of TCP/IP-based distributed applications • Also enables efficiencies compared to library-specific approaches

  6. Outline • Zap (Background) • Migrating Networked Applications • Network Address Migration • Communication State Checkpoint and Restore • Checkpoint-Restart of Distributed Applications • Evaluation • Related Work • Future Work • Summary

  7. Applications Pods Zap Linux Linux System calls Zap (Background) • Process migration mechanism • Kernel module implementation • Virtualization layer groups processes into Pods with private virtual name space • Intercepts system calls to expose only virtual identifiers (e.g., vpid) • Preserves resource names and dependencies across migration • Mechanism to checkpoint and restart pods • User and kernel-level state • Primarily uses system call handlers • File system not saved or restored (assumes a network file system)

  8. Outline • Zap (Background) • Migrating Networked Applications • Network Address Migration • Communication State Checkpoint and Restore • Checkpoint-Restart of Distributed Applications • Evaluation • Related Work • Future Work • Summary

  9. Migrating Networked Applications • Migration must be transparent to remote peers to be useful in server management scenarios • Peers, including unmodified clients, must not perceive any change in the IP address of the application • Communication state of live connections must be preserved • No prior solution for these (including original Zap) • Our Solution: • Provide unique IP address to each pod that persists across migration • Checkpoint and restore the socket control state and socket data buffer state of all live sockets

  10. Network Address Migration 3. dhcprequest(MAC-p1) DHCP Server Pod DHCP Client • Pod attached to virtual interface with own IP & MAC addr. • Implemented by using Linux’s virtual interfaces (VIFs) • IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address) • Intercept bind() & connect() to ensure pod processes use pod’s IP address • Migration: delete VIF on source host & create on new host • Migration limited to subnet 4. dhcpack(IP-p1) 2. MAC-p1 1. ioctl() Network eth0:1 eth0 [IP-1, MAC-h1]

  11. Communication State Checkpoint and Restore Communication state: • Control: Socket data structure, TCP connection state • Data: contents of send and receive socket buffers Challenges in communication state checkpoint and restore: • Network stack will continue to execute even after application processes are stopped • No system call interface to read or write control state • No system call interface to read send socket buffers • No system call interface to write receive socket buffers • Consistency of control state and socket buffer state

  12. Control Control direct access Timers, Options, etc. Timers, Options, etc. X X copied_seq write_seq Rh St+1 Rt+1 Sh Rh St+1 snd_una rcv_nxt Rt+1 Sh Rt+1 Sh direct access Rh St Rh St receive() . . . . . . Sh Rt Send buffers Recv buffers Rt Sh Recv buffers Send buffers Live Communication State Communication State Checkpoint • Acquire network stack locks to freeze TCP processing • Save receive buffers using socket receive system call in peek mode • Save send buffers by walking kernel structures • Copy control state from kernel structures • Modify two sequence numbers in saved state to reflect empty socket buffers • Indicate current send buffers not yet written by application • Indicate current receive buffers all consumed by application Checkpoint State State for one socket Note: Checkpoint does not change live communication state

  13. To App by intercepted receive system call Control direct update Control Timers, Options, etc. Timers, Options, etc. Rt+1 Sh St+1 Rt+1 Sh copied_seq write_seq Rt+1 snd_una rcv_nxt Rt+1 Sh write() Rh Rh St St . . . Sh Rt Rt Recv data Send buffers Recv buffers Sh Send buffers Checkpoint State Live Communication State Communication State Restore • Create a new socket • Copy control state in checkpoint to socket structure • Restore checkpointed send buffer data using the socket write call • Deliver checkpointed receive buffer data to application on demand • Copy checkpointed receive buffer data to a special buffer • Intercept receive system call to deliver data from special buffer until buffer is emptied Sh State for one socket

  14. Outline • Zap (Background) • Migrating Networked Applications • Network Address Migration • Communication State Checkpoint and Restore • Checkpoint-Restart of Distributed Applications • Evaluation • Related Work • Future Work • Summary

  15. Processes Processes Processes TCP/IP TCP/IP TCP/IP Node Node Node Checkpoint-Restart of Distributed Applications Checkpoint • State of processes and messages in channel must be checkpointed and restored consistently • Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel • Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state • Transparently supports TCP/IP-based distributed applications • Enables efficiencies compared to library-based implementations Library Library Library Communication Channel

  16. Pod (processes) Pod (processes) Pod (processes) TCP/IP TCP/IP TCP/IP Node Node Node Checkpoint-Restart of Distributed Applications in Cruz Checkpoint • Global communication state saved and restored by saving and restoring TCP communication state for each pod • Messages in flight need not be saved since the TCP state will trigger retransmission of these messages at restart • Eliminates O(N2) step to flush channel for capturing messages in flight • Eliminates need to re-establish connections at restart • Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart Library Library Library Communication Channel

  17. Consistent Checkpoint Algorithm in Cruz (Illustrative) <checkpoint> <checkpoint> • Algorithm has O(N) complexity (blocking algorithm shown for simplicity) • Can be extended to improve robustness and performance, e.g.: • Tolerate Agent & Coordinator failures • Overlap computation and checkpointing using copy-on-write • Allow nodes to continue without blocking for all nodes to complete checkpoint • Reduce checkpoint size with incremental checkpoints • Disable pod comm§ • Save pod state • Disable pod comm • Save pod state Agent Coordinator Agent Pod Pod <done> Library Library Node Node Node <done> TCP/IP TCP/IP <continue> <continue> • Enable pod comm • Enable pod comm • Resume pod • Resume pod §: using netfilter rules in Linux <continue-done> <continue-done>

  18. Outline • Zap (Background) • Migrating Networked Applications • Network Address Migration • Communication State Checkpoint and Restore • Checkpoint-Restart of Distributed Applications • Evaluation • Related Work • Future Work • Summary

  19. Evaluation • Cruz implemented for Linux 2.4.x on x86 • Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark • Cruz incurs negligible runtime overhead (less than 0.5%) • Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable

  20. Performance Result – Negligible Coordination Overhead • Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes • Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable • Coordination latency of 400-500 microseconds is a small fraction of the overall checkpoint latency of about 1 second

  21. Related Work • MetaCluster product from Meiosys • Capabilities similar to Cruz (e.g., checkpoint and restart of unmodified distributed applications) • Berkeley Labs Checkpoint Restart (BLCR) • Kernel-module based checkpoint-restart for single node • No identifier virtualization – restart will fail in the event of an identifier (e.g., pid) conflict • No support for handling communication state – relies on application or library changes • MPVM, CoCheck, LAM-MPI • Library-specific implementations of parallel application checkpoint-restart with disadvantages described earlier

  22. Future Work Many areas for future work, e.g., • Improve portability across kernel versions by minimizing direct access to kernel structures • Recommend additional kernel interfaces when advantageous (e.g., accessing socket attributes) • Implement performance optimizations to the coordinated checkpoint-restart algorithm • Evaluate performance on a wide range of applications and cluster configurations • Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)

  23. Summary • Cruz, a practical checkpoint-restart system for Linux • No change to applications or to base OS kernel needed • Novel mechanisms to support checkpoint-restart of a broader class of applications • Migrating networked applications transparent to communicating peers • Consistent checkpoint-restart of general TCP/IP-based distributed applications • Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management

  24. http://www.hpl.hp.com/research/dca

  25. Outline • Zap (Background) • Migrating Networked Applications • Network Address Migration • Communication State Checkpoint and Restore • Checkpoint-Restart of Distributed Applications • Evaluation • Related Work • Future Work • Summary

  26. Outline • Zap (Background) • Migrating Networked Applications • Network Address Migration • Communication State Checkpoint and Restore • Checkpoint-Restart of Distributed Applications • Evaluation • Related Work • Future Work • Summary

More Related