280 likes | 502 Views
MinCopysets: Derandomizing Replication in Cloud Storage. Asaf Cidon , Ryan Stutsman, Stephen Rumble, Sachin Katti , John Ousterhout and Mendel Rosenblum. Stanford University. Overview. Assumptions: no geo-replication, Azure uses much smaller clusters in practice.
E N D
MinCopysets: Derandomizing Replication in Cloud Storage AsafCidon, Ryan Stutsman, Stephen Rumble, SachinKatti, John Ousterhoutand Mendel Rosenblum Stanford University
Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute
RAMCloud • Primary data stored on master (memory) • Divide each master’s data into chunks • Chunks are replicated on backups (disk) • When master crashes, recover from thousands of backups CrashedMaster Masters Backups Unpublished – Please do not distribute
Random Replication Chunk 1 Chunk 2 Chunk 3 Node 4 Node 3 Node 2 Node 1 Node 5 Chunk 1 Secondary Chunk 2 Primary Chunk 1 Secondary Chunk 3 Primary Chunk 3 Secondary Node 9 Node 8 Node 7 Node 6 Node 10 Chunk 2 Secondary Chunk 1 Primary Chunk 3 Secondary Chunk 2 Secondary Unpublished – Please do not distribute
The Problem • Randomized replication loses data in power outages • 0.5-1% of the nodes fail to reboot • 1-2 times a year • Result: handful of chunks (GBs of data) are unavailable (LinkedIn ‘12) • Sub-problem: managed power downs • Software upgrades • Reduced power consumption Unpublished – Please do not distribute
Intuition • If we have one chunk, we are safe: • Replicate chunk on three nodes • Data is lost if failed nodes contain three copies of a chunk • 1% of the nodes fail: 0.0001% of data loss • If we have millions of chunks, we lose data: • 1000 node HDFS cluster has 10 million chunks • 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute
Mathematical Intuition • A copyset of nodes is a single unit of failure • Each chunk is replicated on a single copyset • For one chunk, the probability of data loss is: • F = number of failed nodes • R = replication factor • N = number of nodes • For all chunks, the probability is: • B = number of chunks Unpublished – Please do not distribute
Changing R Doesn’t Help Unpublished – Please do not distribute
Changing the Chunk Size Doesn’t Help Unpublished – Please do not distribute
MinCopysets: Decouple Load Balancing and Durability • Split nodes into fixed replication groups • Random Distribution:Place primary replica on random node • Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute
MinCopysets Architecture Chunk 1 Chunk 2 Chunk 3 Chunk 4 Replication Group 1 Replication Group 2 Replication Group 3 Node 2 Node 55 Node 1 Chunk 2 Secondary Chunk 1 Secondary Chunk 4 Primary Chunk 3 Primary Node 83 Node 8 Node 7 Node 24 Node 22 Node 47 Chunk 2 Secondary Chunk 2 Primary Chunk 1 Primary Chunk 1 Secondary Chunk 4 Secondary Chunk 4 Secondary Chunk 3 Secondary Chunk 3 Secondary Unpublished – Please do not distribute
Extreme Failure Scenarios • In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities • For example: • 4000 node HDFS cluster • 120 nodes fail to reboot after power outage • Only 3.5% probability of data loss Unpublished – Please do not distribute
Extreme Failure Scenarios: Normal Clusters Unpublished – Please do not distribute
Extreme Failure Scenarios: Big Clusters Unpublished – Please do not distribute
MinCopysets’ Trade Off • Trades off frequency and magnitude of failures • Expected data loss is the same • Data loss occurs very rarely • The magnitude of data loss is greater Unpublished – Please do not distribute
Frequency vs. Magnitude of Failures • Setup: • 5000 node HDFS cluster • 3 TB per machine • R = 3 • Power outage once a year • Random replication • Lose 5.5 GB every single year • MinCopysets • Lose data once every 625 years • Lose an entire node in case of failure Unpublished – Please do not distribute
RAMCloud Implementation • RAMCloud implementation was relatively straightforward • Two non-trivial issues: • Need to manage groups of nodes • Allocate chunks on entire groups • Manage nodes joining and leaving groups • Machine failures are more complex • Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute
RAMCloud Implementation RAMCloud Coordinator Request: Assign Replication Group RPC Coordinator Server List Request: Open New Chunk RPC RAMCloud Backup RAMCloud Master Reply: Replication Group Unpublished – Please do not distribute
HDFS Implementation • Even simpler than RAMCloud • In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed • NameNodeassigns DataNodes to replication groups • Prototyped in 200 LoC Unpublished – Please do not distribute
HDFS Issues • Has the same issues as RAMCloud in managing groups of nodes • Issue: Repair bandwidth • Solution: Hybrid scheme • Issue: Network bottlenecks and load balancing • Solution: Kill replication group, re-replicate its data elsewhere • Issue:Replication group’s capacity is limited by node with the smallest capacity • Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute
Facebook’s HDFS Replication • Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss • Facebook’s Algorithm: • Primary replica is replicated on node j and rack k • Secondary replicas are replicated on randomly selected nodes among (j+1,… ,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute
Facebook’s Replication Unpublished – Please do not distribute
Hybrid MinCopysets • Split nodes into replication groups of 2 and 15 • First and second replica are always placed on the group of 2 • Third replica is randomly placed on the group of 15
Thank You! Stanford University