Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols

Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda • Introduction • Challenges • Pertinent Background • Proposed Techniques • Implementations • Experimental Setup & Results • Conclusions • Future Work

Computing Intensive Applications

Network Centric Services

Recent Advances

Motivation & Goals Demand for more computing power and high-bandwidth network connections Advances in Microprocessors and Networks Parallel Computing Performance and Scalability Reliability and Availability Simplicity and Accessibility

Reliability Problems Large numbers of CPUs, Memory Modules, Hard Disk Drives, Network Interfaces, Network Switches Low Mean-Time-To-Failure (MTTF) and/or High Failure-In-Time (FIT)

Classification of Failure • Transient Failure • Power glitch • System patch and reboot • ECC trap • Partial “Permanent” Failure • Disk failure • Partial network failure • Wholesale “Permanent” Failure • Total hardware failure • Natural disaster

Availability Problems Large numbers Processes, Threads, Software Barriers, Busy Waiting Temporarily Unresponsive and/or Unavailable

Possible Solutions • Transient Failure • Restart/replay/resume on the same node • Task-migration is possible • Permanent Partial Failure • Rebalance the workload on surviving nodes • Partial task-migration is needed • Permanent Wholesale Failure • Reconfigure the applications and services • Massive task-migration to new platform

Checkpointing • Common feature in high-performance computing (HPC) platforms • Saves the execution state • Application or system-level • Mechanism for task migration

Application vs System Level • Application-level Recovery Point • Developed application specific • Generally smaller footprint • Data accessiblity restrictions • Kernel-level Recovery Point • Snapshot processes • Full resource restoration • Flexibility due to system level preemption

Berkeley Labs Checkpoint/Restart • System-level • Kernel-module • Checkpoint creation implemented • Process recovery implemented • Linked to BLCR libraries at execution • Stores checkpoint data locally (stack, heap, registers, signals, etc.)

Contribution • Enhanced BLCR performance through latency tolerant technique • Increased BLCR availability through novel checkpoint creation technique

I/O Optimization • Avoided extreme modification to BLCR • Reduce the disk latency of checkpoint creation • Implemented a caching technique • Improved I/O performance 4-fold or more • System overhead less than 300KB in experimental test results

Checkpoint Caching • Buffer used as temporary storage • Storage block flushed in large volume • Trade-off between resource consumption and improved I/O efficiency cr_copy(chkptData, count) if(chkptBuf is NULL) kmalloc size of count for chkptBuf space; copy chkptData into chkptBuf; else kmalloc size of count + chkptBuf size for tempBuf space; copy chkptBuf into tempBuf; krealloc chkptBuf for its expanded size; memmove tempBuf into chkptBuf; kfree memory for tempBuf; end if

Optimized Write Operation

Remote Checkpoint • BLCR is limited to local disk storage • Remote checkpoint offers off-site storage option • Uses sockets to transmit data • Needs predefined destination • Outperforms BLCR in some experimental tests

Remote Checkpoint Server • Single thread daemon • Used GCC compiler • Stores the recovery point external to the client node • Could be ported to Microsoft derivative while(true) create socket; bind to address; listen for incoming connections; wait for client to connect; create file descriptor; while(data buffered received) write checkpoint data; close file descriptor; close socket;

Modified Write Operation • TCP packets • MTU must be reached before delivery • Only modification is to the write operation of BLCR if(remote chkpt) if(socket is NULL) create socket; establish connection, if handshake fails break and perform the original_chkpt; end if package checkpoint data; send data message; end if if(original_chkpt) original BLCR write operation; end if

Design I/O Optimization Write Remote Checkpoint Write write(chkptData, count) if(chkptBuf has space for the incoming chkptData) cr_copy(ckptData, count); else vfs_write(chkptBuf); vfs_write(chkptData); kfree(chkptBuf); end if

Experimental Setup I/O Optimization Remote Checkpoint Dell PowerEdge 700, 2.80 GHz Dual-processor Intel Pentium 4, 3 GB Memory, 5,400 RPM Hard Disk, Linux 2.6 Dell Workstation, 3.06 GHz Intel Pentium 4, 1 GB Memory, 5,400 RPM Hard Disk, Linux 2.6 BLCR Implementation BLCR with NFS (BLCR+NFS) BLCR with our Remote Checkpoint Technique (BLCR+R) • Dell Workstation, 3.06 GHz Intel Pentium 4, 1 GB Memory, 5,400 RPM Hard Disk, Linux 2.6 • BLCR Implementation • Optimized BLCR (O-BLCR) Implementation

Benchmarks Program Resource Utilization • NP-Complete • Data Encryption • Linear Equation Solver • File Compression

I/O Optimization Results

Remote Checkpoint Results

Conclusion • Minimal modification to BLCR • I/O optimization technique reduced the write latency of BLCR • Remote checkpoint increases BLCR availability with new feature • These techniques should be deployed into the foundation of BLCR source code

Future Work • Server authentication protocol • Data packet encryption • Automated process load balancing

Questions

Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols