870 likes | 991 Views
Lecture XIII: Replication-II. CMPT 401 2008 Dr. Alexandra Fedorova. Outline. Harp A replicated research file system Google File System A real replicated file system Amazon Distributed Data Store A distributed database. Questions about Harp.
E N D
Lecture XIII: Replication-II CMPT 401 2008 Dr. Alexandra Fedorova
Outline • Harp • A replicated research file system • Google File System • A real replicated file system • Amazon Distributed Data Store • A distributed database
Questions about Harp • Does HARP use the two-phase commit protocol? If so, when and how? How does it differ from the 2PC protocol we studied in class? • How many replicas that keep copies of data do we need to survive n failures? How many total participants must we have to survive n failures? • Describe normal operation in Harp. Explain the following: • What the primary does • What the replica does • What the witness does • How does Harp survive failures without flushing updates to disk before responding to the client?
Overview of Harp • Uses primary copy replication for • Reliability • Availability • Single primary server, backups and witness • Accessed via NFS interface • Performance was a concern – operations log is kept in memory only: • To guard against machine failures: other replicas will have the log in memory • To guard against power failures: each machine has a UPS, upon power failure there is time to flush log to persistent storage
Access via NFS Interface User application OS OS NFS client NFS server • Replicated FS: • Primary • Backup • Witness
Failover Transparent to Clients User application primary OS OS NFS client 192.168.51.2 NFS server backup OS • Data is sent to a multicast address • Reaches all potential primaries • Discarded by hardware at all except the primary NFS server witness OS NFS server
Goals and Environment of Harp • Provide highly available file system service via replication • Assume failstop failures • Survive network partitions • Assume synchronous system (?) – probably, because they rely on timeouts when detecting node failure • In many systems, replication caused performance degradation – replica communication slowed down the sending of response to the client • Harp’s goal was to provide reliability and availability without performance loss
Harp’s Components • In presence of network partitions, must have 2n + 1replicated components to survive n failures • The quorum, (the majority (n+1) servers) get to form a new group and elect a new primary • Usually data is replicated on 2n+1 replicas • In Harp, data is replicated on n+1 servers • The other servers are used to create quorum • They are called witnesses
Harp’s Witness primary primary backup backup witness witness • Backup and primary cannot communicate • Who should be the primary? • Witness resolves the tie in favor of primary • Data survives at the primary • Witness resolves the tie in favor of backup • Data survives at the backup
Harp: Normal Operation backup 3. Forward request to backup primary 5. Respond to primary 4. Record the operation in the in-memory log 1. Send request to the primary 2. Record the operation in the in-memory log witness 7. Respond to client 6. “Commit” the operation – mark it as committed in memory 8. Tell the back up to commit
Two-phase Protocol for Updates • Phase 1: • send updates to all backups • wait for backups to respond • send response to the client • Phase 2: • backups are informed about commit • backups commit the operation locally • Phase 1 is in the critical path • Phase 2 happens on the background • Phase 1 is quick, because updates do not have to be written to disk
In-Memory Logging • Client operations are recorded in the in-memory logs (at the primary and at the backup) when the response is sent to client • Operations are applied to the file system later, in the background • This is done to remove disk access out of critical path when communicating with the client • What if there primary fails? • That’s okay, because in-memory log survives at the backup • What if there is a power failure? • The machine will operate for a while on UPS – this time will be used to apply operations in the log to the file system
Write-Behind Logging CP – commit pointer – most recently committed event record GLB – most recently event that has reached the local disk at primary and backup Record n Record n+1 Record n+2 LB – most recently event that has reached the local disk Record n+3 Record n+4 AP – most recently applied event record Record n+5 Record n+6 … On failure the server restores the log and re-does all committed operations in the log
Log Updates: Commit Pointer • Primary receives the client request • A log record is created at the primary • Primary forwards request to the backups • Backups add records to their logs • Backups acknowledge receipt of records to the primary • Primary commits the operation • Advances commit pointer CP • Sends the commit decision to the backup • Backup advances its own CP
Log Updates: Application Pointer • The “Apply” process • Runs on the background • Applies committed records to disk • Advances AP pointer • Can we discard records before the AP pointer? • No! Writes are asynchronous • A committed record may not necessarily be on disk
Log Updates: LB and GLB pointers • Another process that checks when writes associated with log records have been applied to the file system • When writes have finished, it advances the LB pointer • GLB: Global LB pointer: all records up to this pointer have been applied to disk at both the primary and the backup • Records below GLB pointer can be discarded • Log invariant: GLB <= LB <= AP <= CP
Non-modification Operations • Performed entirely at the primary • No communication with backups • Problem: what if the backup becomes disconnected from the primary and forms a new view? • Then the primary may respond to a read operation with old state (i.e., it may not know that a file has been updated) • How does Harp solve this problem? • Backup sends a promise to the server to not change a view within time t + σ. Within that time, the primary can respond to read operations without talking to backup. • After that, it must contact backup before performing a non-modification operation, to get a new promise.
Handling Failures: View Changes • View –a composition of the group and the roles of the members • When some members fail, the view has to change • A view change selects the members of the new view and makes sure that the state of the new view reflects all committed operations form previous views • The designated primary and backup monitor other group members to detect changes in communication ability • If they cannot communicate with some of the members, a view change is needed • Either a primary or a backup can initiate a view change (not witness)
View Change primary primary backup backup witness witness • Backup cannot reach the primary, but it can reach the witness • Backup initiates the view change • Primary cannot reach with backup, but can reach the witness • Primary initiates a view change
Causes and Outcomes of View Changes • A primary fails, so a new primary is needed • A backup will become the primary after a view change • A backup fails, someone else needs to replicate the state at the primary • Witness is configured to act as a backup – the witness is promoted • A primary that had failed comes back • It will bring itself up-to-date (using other servers’ logs) and will become the primary again • A backup that had failed comes back • It will bring itself up-to-date; the previously promoted witness will no longer act as backup – the witness is demoted
View Change: The Algorithm • The node that starts the view change acts as coordinator • Phase 1: • Coordinator tells others it wants to start a view change • Others stop processing any operations and send the coordinator their state, i.e., log records (that the coordinator does not already have) • The coordinator applies the log records to bring itself up-to-date
View Change: The Algorithm • Phase 2: • The coordinator writes the new view number to disk • Sends the view state to all participants • If both backup and witness responded, witness will be demoted • If only the witness responded, witness will be promoted • Other nodes write the view number to disk
A Promoted Witness • Does not have a copy of the file system state • Under normal operation, does not update the file system • A promoted witness begins logging filesystem state • Upon promotion receives all log records that have not made it to disk (everything later than the GLB pointer) • Promoted witness never discards log records • When the log becomes too large, it is stored on disk or tape
Simultaneous View Changes • Suppose primary and backup cannot communicate with each other • They both initiate a view change simultaneously • One view change will be redundant – don’t want to waste time/resources on a useless view change • Solution: delay the view change at the backup • This way the primary is most likely to “win the race” for the view change • What happens if simultaneous view changes are in place?
Optimizations for Fast View Changes • User operations are not processed during a view change, so view changes must be fast • A view change may be slow if the server that must bring itself up-to-date must receive lots of log records from other servers • Therefore, the server that must bring itself up-to-date in a new view (i.e., the primary that comes back after failure) brings itself up-to-date before initiating the view change • If the server’s disk is intact it gets log records from the witness • If the disk is damaged, it get FS state from the backup and then it gets log records from the witness
Other Optimizations • When the witness is promoted, it must receive all log entries beyond GLB • The number of entries is likely to be large, so the view change may be slow • To expedite the view change, the witness is kept in hot standby • The primary sends all updates to the witness. The witness logs them, but does not acknowledge them. It discards the old entries from memory, does not log them to disk or tape
Guarding Against a “Killer Packet” • Many crashes are due to software bugs • Some bugs may cause simultaneous failure at the primary and backup – i.e., an OS bug is triggered by a certain FS operation • To guard against this, the backup waits with applying changes to the FS until they have been applied at the primary APbackup ≤ APprimary • If the primary fails after applying a certain change, the backup will likely initiate the view change and will send the log to the witness • So even if the backup fails after applying the same operation that crashed the primary, the record of that operation won’t be lost
A Potential Failure Scenario backup primary 1. Receive operation from the client 3. Record the operation in the log 2. Forward it to backup 4. Respond to the primary 5. Commit the operation • Backup does not know if the operation was committed • Does it assume it was not committed and discard log entries? • Does it assume it committed and apply the results? 6. Respond to the client 7. Crash
Let’s Play Harp! • Let’s go over all the steps • During normal operation • And with failures
Summary • Primary-copy file system • Unlike other replicated file system, provides good performance, because disk writes are not in the critical path • Needs at least 2n+1 participants to handle n failures • Data is replicated only on n+1 servers, to save disk space • Wishing to have evidence/discussion on: • How the system works with view changes • What happens if a component crashes during a view change? • What happens with log records of uncommitted operations?
Google File System • A real massive distributed file system • Hundreds of servers and clients • The largest cluster has >1000 storage nodes, over 300 TB of disk storage, hundreds of clients • Metadata replication • Data replication • Design driven by application workload and technological environment • Avoided many of the difficulties traditionally associated with replication by designing for a specific use case
Specifics of the Google Environment • FS is consists of hundreds of storage machines, built of inexpensive commodity parts • Component failures are a norm • Application and OS bugs • Human errors • Hardware failures: disks, memory, network, power supplies • Millions of files, each 100 MB or larger • Multi-GB files are common • Applications are written for GFS • Allows co-design of the file system and applications
Specifics of the Google Workload • Most files are mutated by appending new data – large sequential writes • Random writes are very uncommon • Files are written once, then they are only read • Reads are sequential • Large streaming reads and small random reads • High bandwidth is more important than low latency • Google applications: • Data analysis programs that scan through data repositories • Data streaming applications • Archiving • Applications producing (intermediate) search results
GFS Architecture (cont.) • Single master • Multiple chunk servers • Multiple clients • Each is a commodity Linux machine, a server is a user-level process • Files are divided into chunks • Each chunk has a handle (an ID assigned by the master) • Each chunk is replicated (on three machines by default) • Master stores metadata, manages chunks, does garbage collection, etc. • Clients communicate with master for metadata operations, but with chunkservers for data operations • No additional caching (besides the Linux in-memory buffer caching)
Client/GFS Interaction • Client: • Takes file and offset • Translates it into the chunk index within the file • Sends request to master, containing file name and chunk index • Master: • Replies with the corresponding chunk handle and location of the replicas (the master must know where the replicas are) • Client: • Caches this information • Contacts one of the replicas (i.e., a chunkserver) for data
Master • Stores metadata • The file and chunk namespaces • Mapping from files to chunks • Locations of each chunk’s replicas • Interacts with clients • Creates chunk replicas • Orchestrates chunk modifications across multiple replicas • Ensures atomic concurrent appends • Locks concurrent operations • Deletes old files (via garbage collection)
Metadata On Master • Metadata – data about the data: • File names • Mapping of file names to chunk IDs • Chunk locations • Metadata is kept in memory • File names and chunk mappings are also kept persistent in an operation log • Chunk locations are kept in memory only • They will be lost during the crash • The master asks chunk servers about their chunks at startup – builds a table of chunk locations
Why Keep Metadata In Memory? • To keep master operations fast • Master can periodically scan its internal state in the background, in order to implement: • Garbage collection • Re-replication (in case of chunk server failures) • Chunk migration (for load balancing) • But the file system size is limited by the amount of memory on the master? • This has not been a problem for GFS – metadata is compact
Why Not Keep Chunk Locations Persistent? • Chunk location – which chunk server has a replica of a given chunk • Master polls chunk servers for that information on startup • Thereafter, master keeps itself up-to-date: • It controls all initial chunk placement, migration and re-replication • It monitors chunkserver status with regular HeartBeat messages • Motivation: simplicity • Eliminates the need to keep master and chunkservers synchronized • Synchronization would be needed when chunkservers: • Join and leave the cluster • Change names • Fail and restart
Operation Log • Historical record of metadata changes • Maintains logical order of concurrent operations • Log is used for recovery – the master replays it in the event of failures • Master periodically checkpoints the log • Checkpoint is a B-tree data structure • Can be loaded into memory • Used for namespace lookup without extra parsing • Checkpoint can be done on the background
Data Consistency in GFS • Loose data consistency – applications are designed for it • Applications may see inconsistent data – data is different on different replicas • Applications may see data from partially completed writes – undefined file region • On successful modification the file region is consistent • A write may leave the region undefined – if the client reads the file before another client’s write is complete • Replicas are not guaranteed to be bytewise identical (we’ll see why later, and how clients deal with this)
Data Consistency in GFS (cont.) • Failures: • A modification may fail at one or more replicas • On modification failure, file region is inconsistent • Successes: • Modifications are applied to a chunk in the same order on all replicas • After a number of successful modifications, the file region is guaranteed to be defined: • All replicas have the same data • All replicas contain all the data written by all the write operations
Implications of Loose Data Consistency For Applications • Applications are designed to handle loose data consistency • Example 1: a file is generated from beginning to end • An application creates a file with a temporary name • Atomically renames the file • May periodically checkpoint the file while it is written • File is written via appends – more resilient to failures than random writes • Example 2: producer-consumer file • Many writers concurrently append to one file (for merged results) • Each record is self-validating (contains a checksum) • Client filters out padding and duplicate records
Updates of Replicated Data • Each mutation (modification) is performed at all the replicas • Modifications are applied in the same order across all replicas • Master grants a chunk lease to one replica – i.e., the primary • The primary picks a serial order for all mutations to the chunk • The client pushes data to all replicas • The primary tells the replicas in which order they should apply modifications
Updates of Replicated Data (cont.) • Client asks master for replica locations • Master responds • Client pushes data to all replicas; replicas store it in a buffer cache • Client sends a write request to the primary (identifying the data that had been pushed) • Primary forwards request to the secondaries (identifies the order) • The secondaries respond to the primary • The primary responds to the client
Failure Handling During Updates • If a write fails at the primary: • The primary may report failure to the client – the client will retry • If the primary does not respond, the client retries from Step 1 by contacting the master • If a write succeeds at the primary, but fails at several replicas • The client retries several times (Steps 3-7)
Data Flow • Data flow is decoupled from control flow • Data is pushed linearly across all chunkservers in a pipelined fashion (not necessarily from client to primary and from primary to secondary) • Client forwards data to the closest replica; that replica forwards to the next closest replica, etc. • Pipelined fashion: while the data is incoming, the server begins forwarding it to the next replica • This design ensures good network utilization
Atomic Record Appends • Atomic append is a write – but GFS (the primary replica) chooses the offset where the append happens; returns the offset to the client • This way GFS can decide on serial order of concurrent appends without client synchronization • If an append fails at some replicas – the client retries • As a result, the file may contain multiple copies of the same record, plus replicas may be bytewise different • But after a successful update all replicas will be defined – they will all have the data written by the client at the same offset
Non-Identical Replicas • Because of failed and retried record appends, replicas may be non-identical bytewise • Some replicas may have duplicate records (because of failed and retried appends) • Some replicas may have padded file space (empty space filled with junk) – if the master chooses record offset higher than the first available offset at a replica • Clients must deal with it: they write self-identifying records so they can distinguish valid data from junk • If clients cannot tolerate duplicates, they must insert version numbers in records • GFS pushes complexity to the client; without this, complex failure recovery scheme would need to be in place