Lampson Sturgis Fault Model

Lampson Sturgis Fault Model Mon Tue Wed Thur Fri 9:00 Overview TP mons Log Files &Buffers B-tree Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 11:00 Faults Lock Theory ResMgr COM+ Access Paths 1:30 Tolerance Lock Techniq CICS & Inet Corba Groupware 3:30 T Models Queues Adv TM Replication Benchmark 7:00 Party Workflow Cyberbrick Party

Rationale Fault Tolerance Needs a Fault ModelWhat do you tolerate? Fault tolerance needs a fault model. Model needs to be simple enough to understand. With a model, can design hardware/software to tolerate the faults. can make statements about the system behavior.

Byzantine Fault Model Some modules are fault free (during the period of interest).Other modules may fail (in the worst way). Make statements about of the fault-free module behavior SynchronousAll operations happen within a time limit. Asynchronous:No time limit on anything, No lost messages. Timed: (used here)Notion of timeout and retry Key result: N modules can tolerate N/3 faults.

Lampson Sturgis Model Processes:Correct: Execute a program at a finite rate.Fault: Reset to null state and "stop" for a finite time. Message:Correct: Eventually arrives and is correct.Fault: Lost, duplicated, or corrupted. Storage:Correct: Read(x) returns the most recent value of x. Write(x, v) sets the value of x to v.Fault: All pages reset to null. A page resets to null. Read or Write operate on the wrong page. Other faults (called disasters) not dealt with. Assumption: Disasters are rare.

Byzantine vs.Lampson-Sturgis Fault Models Connections unclear. Byzantine focuses on bounded-time bounded-faults (real-time systems) asynchronous (mostly) or synchronous (real time) Lampson/Sturgis focuses on long-term behavior no time or fault limits time and timeout heavily used to detect faults

Roadmap of What's Coming • Lampson-Sturgis Fault Model • Building highly available processes, messages, storage from faulty components. • Process pairs give quick repair • Kinds of process pairs: • Checkpoint / Restart based on storage • Checkpoint / Restart based on messages • Restart based on transactions (easy to program).

Model of Storage and its Faults System has several stores (discs). Each has a set of pages. Stores fail independently. probability write has no effect: 1 in a million mean time to a page fail, a few days mean time to disc fail is a few years wild read/write modeled as a page fail. a store status store_write(store, address, value) a page status value store_read (store, address, value)

PageDecay Store Failure Storage Decay (the demon) /* There is one store_decay process for each store in the system */ #define mttvf 7E5 /* mean time (sec) to a page fail, a few days */ #define mttsf 1E8 /* mean time(sec) to disc fail is a few years */ void store_decay(astore store) /* */ { Ulong addr; /* the random places that will decay */ Ulong page_fail = time() + mttvf*randf();/* timeto next page decay */ Ulong store_fail = time() + mttsf*randf(); /* timeto next store decay */ while (TRUE) /* repeat this loop forever */ { wait(min(page_fail,store_fail) - time());/* wait for next event*/ if (time() >= page_fail) /* if the event is a page decay */ { addr = randf()*MAXSTORE; /* pick a random address */ store.page[addr].status = FALSE; /* set it invalid */ page_fail = time() - log(randf())*mttvf; /* pick next fault time*/ }; /* negative exp distributed, mean mttvf */ if (time() >= store_fail) /* if the event is a storage fault */ { store.status = FALSE; /* mark the store as broken */ for (addr = 0; addr < MAXSTORE; addr++) /*invalidate all pages */ store.page[addr].status = FALSE; /* */ store_fail = time() + log(randf())*mttsf; /* pick next fault time*/ }; /* negative exp distributed, mean mttsf */ }; /* end of endless while loop */ }; /* */ Simulates (specifies) system behavior.

Reliable Write Reliable Write: Write all members of a N-plex set. #define nplex 2 /* code works for n>2, but do duplex */ Boolean reliable_write(Ulong group, address addr, avalue value) /* */ { Ulong i; /* index on elements of store group */ Boolean status = FALSE; /* true if any write worked */ /* each group uses Nplex stores */ for (i = 0; i < nplex; i++ ) /*write each store in the group */ { status = status || /* status indicates if any write worked */ store_write(stores[group*nplex+i],addr,value); /* */ } /* loop to write all stores of group */ return status; /* return indicates if ANY write worked*/ }; /* */

Reliable read Reliable Read: read all members of N-plex set Problems: All fail: Disaster Ambiguity: (N-different answers) Take majority Take "newest" Ulong version(avalue); /* returns version of a value */ /* read an n-plex group to find the most recent version of a page */ Boolean reliable_read(Ulong group, address addr, avalue value) /* */ { Ulong I = 0; /* index on store group */ Boolean gotone = FALSE; /* flag says had a good read */ Boolean bad = FALSE; /* bad says group needs repair */ avalue next; /* next value that is read */ Boolean status; /* read ok */ for (i = 0; i < nplex; i++ ) /* for each page in the nplex set */ { status = store_read(stores[group*nplex+i],addr,next); /*read value */ if (! status ) bad = TRUE; /* if status bad, ignore value */ else /* have a good read */ if (! gotone) /* if it is first good value */ {copy(value,next,VSIZE); gotone = TRUE;}/* make it best value */ else if ( version(next) != version(value)) /*if new val,compare */ { bad = TRUE; /* if different, repair needed */ if (version(next) > version(value)) /* if new is best version */ copy(value, next, VSIZE); /* copy it to best value */ }; }; /* end of read all copies */ if (! gotone) return FALSE; /* disaster, no good pages */ if (bad) reliable_write(group,addr,value); /* repair any bad pages */ return TRUE; /* success */ on bad read rewrite with best value

Data Scrubber Reliable read Background Store Repair Process /* repair the broken pages in an n-plex group. */ /* Group is in 0,...,(MAXSTORE/nplex)-1 */ void store_repair(Ulong group) /* */ { int i; /* next address to be repaired */ avalue value; /* buffer holds value to be read */ while (TRUE) /* do forever */ {for (i = 0; i <MAXSTORE; i++) /* for each page in the store */ { wait(1); /* wait a second */ reliable_read(group,i,value); /* a reliable read repairs page*/ }; };}; /* if they do not match */ Needed to minimize chances of N-failures. Repair is important. on bad read rewrite with best value

Optimistic Reads Most implementations do optimistic reads: read only one value. Boolean optimistic_read(Ulong group,address addr,avalue value) /* */ {if (group >= MAXSTORES/nplex) return FALSE; /* return false if bad addr*/ if (store_read(stores[nplex*group],addr,value)) /* read one value */ return TRUE; /* and if that is ok return it as the true value */ else /* if reading one value returned bad then */ return (reliable_read(group,addr,value)); /* n-plex read & repair. */ }; /* */ This is dangerous (especially without repair).

Storage Fault Summary • Simple fault model. • Allows discussion/specification of fault tolerance. • Uncovers some problems in many implementations: • Ambiguous reads • Repair process. • Optimistic reads.

Process Fault Model • Process executes a program and has state. • Program causes state change plus: send/get message. • Process fails by stopping (for a while) and then resetting its data and message state. Queue of a Sender Process Input Messages to the process new Receiver Process next value status Data Data Program message Program

Execute 4 Months Fail!!! Repair 3 hrs Process Fault Model: The Break/Fix loop #define MAXPROCESS MANY /* the system will have many processes */ typedef Ulong processid; /* process id is an integer index into array */ typedef struct {char program[MANY/2];char data[MANY/2]} state;/* program + data */ struct { state initial; /* process initial state */ state current; /* value of the process state */ amessagep messages; /* queue of messages waiting for process */ } process [MAXPROCESS]; /* */ /* Process Decay : execute a process and occasionally inject faults into it */ #define mttpf 1E7 /* mean time to process failure Å4 months */ #define mttpr 1E4 /* mean time to repair is 3 hours */ void process_execution(processid pid) /* */ { Ulong proc_fail; /* time of next process fault */ Ulong proc_repair; /* time to repair process */ amessagep msg, next; /* pointers to process messages */ while (TRUE) /* global execution loop */ { proc_fail = time() - log(randf())*mttpf; /* the time of next fail */ proc_repair = -log(randf())*mttpr; /* delay in next process repair */ while (time() < proc_fail) /* */ { execute(process[pid].current);}; /* execute for about 4 months (work) */ (void) wait(proc_repair); /* wait about 3 hrs for repair (break) */ copy(process[pid].current,process[pid].initial,MANY); /* reset (fix) */ while (message_get(msg,status) {}; /* read and discard all msgs in queue */ }; }; /* bottom of work, break, fix loop */

At Researt Get Ticket Number From Disk Get request bump ticket # Save to disk Send to client Checkpoint/Restart Process (Storage based) /* A checkpoint-restart process server generating unique sequence numbers */ checkpoint_restart_process() /* */ { Ulong disc = 0; /* a reliable storage group with state */ Ulong address[2] = {0,1}; /* page address of two states on disc */ Ulong old; /* index of the disc with the old state */ struct { Ulong ticketno; /* process reads its state from disc. */ char filler[VSIZE]; /* newest state has max ticket number */ } value [2]; /* current state kept in value[0] */ struct msg{ /* buffer to hold input message */ processid him; /* contains requesting process id */ char filler[VSIZE]; /* reply (ticket num) sent to process */ } msg; /* */ /* Restart logic: recover ticket number from persistent storage */ for (old = 0; old<=1, old++) /* read the two states from disc */ { if (!reliable_read(disc,address[old],value[old] )) /*if read fails */ panic(); }; /* then failfast */ if (value[1].ticketno < value[0].ticketno) old = 1; /* pick max seq no */ else { old = 0; copy(value[0], value[1],VSIZE);};/*which is old val */ /* Processing logic: generate next number, checkpoint, and reply */ while (TRUE) /* do forever */ { while (! get_msg(&msg)) {}; /* get next request for a ticket number */ value[0].ticketno = value[0].ticketno + 1; /* increment ticket num */ if ( ! reliable_write(disc,address[old],value[0])) panic(); /* checkpoint */ old = (old + 1) % 2; /* use other disc for state next time */ message_send(msg.him, value[0]); /* send the ticket number to client */ }; }; /* endless loop to get messages. */

Give Me A Ticket Ticket number Ticket # Process Pairs (message-based checkpoints) Give me a ticket Problem Solutions Detect failure I'm Alive msg timeout No "real" solution. Continuation: Checkpoint Messages Startup backup waits for primary Primary Server Process Next Ticket Number Client Processes Ticket Numbers I'm Alive State Checkpoint Messages Messages Backup Server Process Next Ticket Number

Process Pairs (message-based checkpoints) Restart am I • Primary in tight loop sending "I'm alive" or state change messages to backup • Backup thinks primary dead if no messages in previous second. Wait a second default primary? - Backup Loop + + Wait a second new state any - - Broadcast: "Im Primary" in last second? input? + Reply to last request Read it Primary Loop any newer - input? - state? + + Set my state to new state requests Read it Compute new state. Send state to backup. Im alive Send new state to backup. reply replies

What We Have Done So Far Converted "faulty" processes to reliable ones. Tolerate hardware and some software faults Can repair in seconds or milli-seconds. Unlike checkpoint restart: No process creation/setup time No client reconnect time. Operating systems are beginning to provide process pairs. Stateless process pairs can use transactional servers to Store their state Cleanup the mess at takeover. Like storage-based checkpoint/restart except process setup/connection is instant.

Persistent process pairs persistent_process() /* prototypical persistent process */ { wait_to_be_primary(); /* wait to be told you are primary */ while (TRUE) /* when primary, do forever */ { begin_work(); /* start transaction or subtransaction */ read request(); /* read a request */ doit(); /* perform the desired function */ reply(); /* reply */ commit_work(); /* finish transaction or subtransaction*/ }; /* did a step, now get next request */ }; /* */

Wait to be Primary Begin Trans & Get request bump ticket # in Database Commit and Send to client Persistent Process Pairs The ticket server redone as a transactional server. /* A transactional persistent server process generating unique tickets */ perstistent_ticket_server() /* current state kept in sql database */ { int ticketno; /* next ticket # ( from DB) */ struct msg{ /* buffer to hold input message */ processid him; /* contains requesting process id */ char filler[VSIZE]; /* reply (ticket num) sent to that addr */ } msg; /* */ /* Restart logic: recover ticket number from persistent storage */ wait_to_be_primary(); /* wait to be told you are primary */ /* Processing logic: generate next number, checkpoint, and reply */ while (TRUE) /* do forever */ { begin_work(); /* begin a transaction */ while (! get_msg(&msg)); /* get next request for a ticket */ exec sql update ticket /* increment the next ticket number */ set ticketno = ticketno + 1; /* */ exec sql select max(ticketno) /* fetch current ticket number */ into :ticketno /* into program local variable */ from ticket; /* from SQL database */ commit_work(); /* commit transaction */ message_send(msg.him, value); /* send the ticket number to client */ }; }; /* endless loop to get messages. */

Messages: Fault Model Each process has a queue of incoming messages. Messages can be corrupted: checksum detects it duplicated: sequence number detects it. delayed arbitrarily long (ack + retransmit). can be lost(ack + retransmit+seq number). Techniques here give messages fail-fast semantics.

Build& Queue Msg Corrupt Msg Duplicate Msg Message Verbs: SEND /*send a message to a process: returns true if the process exists */ Booleanmessage_send(processid him, avalue value) /* */ { amessagep it; /* pointer to message created by this call*/ amessagep queue; /* pointer to process message queue */ if (him > MAXPROCESS) return FALSE; /* test for valid process */ loop: it = malloc(sizeof(amessage)); /* allocate space to hold message */ it->status = TRUE; it->next = NULL; /* and fill in the fields */ copy(it->value,value,VSIZE); /* copy msg data to message body */ queue = process[him].messages; /* look at process message queue */ if (queue == NULL) process[him].messages = it; /* if the empty then */ else /* place this message at queue head */ {while (queue->next != NULL) queue = queue->next; /* else place */ queue->next = it;} /* the message at queue end . */ if (randf() < pmf) it->status = FALSE; /* sometimes message corrupted */ if (randf() < pmd) goto loop; /* sometimes the message duplicated */ return TRUE; /* */ }; /* */

Message Verbs: GET /* get the next input message of this process: returns true if a message */ Booleanmessage_get(avalue * valuep, Boolean * msg_status)/* */ { processid me = MyPID(); /* caller’s process number */ amessagep it; /* pointer to input message */ it = process[me].messages; /* find caller’s message input queue */ if (it == NULL) return FALSE; /* return false if queue is empty */ process[me].messages = it->next;/* take first message off the queue */ *msg_status = it->status; /* record its status */ copy(valuep,it->value,VSIZE); /* value = it->value */ free(it); /* deallocate its space */ return TRUE; /* return status to caller */ }; /* */

Process Process 7 out in 6 7 •••••••••• acknowledged 3 6 acknowledged in out 3 3 Process out in 7 7 •••ack 7•••• 3 6 acknowledged acknowledged in out 3 3 7 out in 7 acknowledged 3 7 acknowledged in out 3 3 Sessions Make Messages FailFast Session 7 • CRC makes corrupt look like lost message • Sequence numbers detect duplicates => lost message • So, only failure is lost message • Timeout/retransmit masks lost messages. => Only failure is delay. Ack 7

Process Pair checkpoint 7 out 7 out 6 ack 6 acknowledged in in 3 3 Session Process send 7 ••••••••••••••• 6 7 out 7 out in 3 acknowledged 6 acked 6 acked 3 in in out 3 3 7 ••••••••••••••• 7 out in 7 •••ack 7••••••• 6 acked acknowledged 3 in 3 out 3 •••ack 7••••••• 7 out 7 in 3 7 acked acknowledged in 3 3 out Sessions Plus Process Pairs Give Highly Available Messages Checkpoint messages and sequence numbers to backup Backup resumes session if primary fails. Backup broadcasts new identity at takeover (see book for code) 7 ack 7 7 ack 7

Acknowledged Input Messages Highly Available Message Verbs Output Message Session Application Programs Hide under reliable get/send msg • Sequence number, • ack retransmit logic • checkpoint • process pair takeover • resend of most recent reply. Uses a Listener process (thread) to do all this async work reliable_send_msg() reliable_get_msg() Input Message Session The Listener Process

Summary Went from faulty storage, processes, messages to fault tolerant versions of each. Simple fault model explains many techniques used (and mis-used) in FT systems.

Lampson Sturgis Fault Model