MinCopysets: Derandomizing Replication in Cloud Storage

MinCopysets: Derandomizing Replication in Cloud Storage AsafCidon, Ryan Stutsman, Stephen Rumble, SachinKatti, John Ousterhoutand Mendel Rosenblum Stanford University

Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute

RAMCloud • Primary data stored on master (memory) • Divide each master’s data into chunks • Chunks are replicated on backups (disk) • When master crashes, recover from thousands of backups CrashedMaster Masters Backups Unpublished – Please do not distribute

Random Replication Chunk 1 Chunk 2 Chunk 3 Node 4 Node 3 Node 2 Node 1 Node 5 Chunk 1 Secondary Chunk 2 Primary Chunk 1 Secondary Chunk 3 Primary Chunk 3 Secondary Node 9 Node 8 Node 7 Node 6 Node 10 Chunk 2 Secondary Chunk 1 Primary Chunk 3 Secondary Chunk 2 Secondary Unpublished – Please do not distribute

The Problem • Randomized replication loses data in power outages • 0.5-1% of the nodes fail to reboot • 1-2 times a year • Result: handful of chunks (GBs of data) are unavailable (LinkedIn ‘12) • Sub-problem: managed power downs • Software upgrades • Reduced power consumption Unpublished – Please do not distribute

Intuition • If we have one chunk, we are safe: • Replicate chunk on three nodes • Data is lost if failed nodes contain three copies of a chunk • 1% of the nodes fail: 0.0001% of data loss • If we have millions of chunks, we lose data: • 1000 node HDFS cluster has 10 million chunks • 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute

Mathematical Intuition • A copyset of nodes is a single unit of failure • Each chunk is replicated on a single copyset • For one chunk, the probability of data loss is: • F = number of failed nodes • R = replication factor • N = number of nodes • For all chunks, the probability is: • B = number of chunks Unpublished – Please do not distribute

Changing R Doesn’t Help Unpublished – Please do not distribute

Changing the Chunk Size Doesn’t Help Unpublished – Please do not distribute

MinCopysets: Decouple Load Balancing and Durability • Split nodes into fixed replication groups • Random Distribution:Place primary replica on random node • Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute

MinCopysets Architecture Chunk 1 Chunk 2 Chunk 3 Chunk 4 Replication Group 1 Replication Group 2 Replication Group 3 Node 2 Node 55 Node 1 Chunk 2 Secondary Chunk 1 Secondary Chunk 4 Primary Chunk 3 Primary Node 83 Node 8 Node 7 Node 24 Node 22 Node 47 Chunk 2 Secondary Chunk 2 Primary Chunk 1 Primary Chunk 1 Secondary Chunk 4 Secondary Chunk 4 Secondary Chunk 3 Secondary Chunk 3 Secondary Unpublished – Please do not distribute

Unpublished – Please do not distribute

Extreme Failure Scenarios • In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities • For example: • 4000 node HDFS cluster • 120 nodes fail to reboot after power outage • Only 3.5% probability of data loss Unpublished – Please do not distribute

Extreme Failure Scenarios: Normal Clusters Unpublished – Please do not distribute

Extreme Failure Scenarios: Big Clusters Unpublished – Please do not distribute

MinCopysets’ Trade Off • Trades off frequency and magnitude of failures • Expected data loss is the same • Data loss occurs very rarely • The magnitude of data loss is greater Unpublished – Please do not distribute

Frequency vs. Magnitude of Failures • Setup: • 5000 node HDFS cluster • 3 TB per machine • R = 3 • Power outage once a year • Random replication • Lose 5.5 GB every single year • MinCopysets • Lose data once every 625 years • Lose an entire node in case of failure Unpublished – Please do not distribute

RAMCloud Implementation • RAMCloud implementation was relatively straightforward • Two non-trivial issues: • Need to manage groups of nodes • Allocate chunks on entire groups • Manage nodes joining and leaving groups • Machine failures are more complex • Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute

RAMCloud Implementation RAMCloud Coordinator Request: Assign Replication Group RPC Coordinator Server List Request: Open New Chunk RPC RAMCloud Backup RAMCloud Master Reply: Replication Group Unpublished – Please do not distribute

HDFS Implementation • Even simpler than RAMCloud • In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed • NameNodeassigns DataNodes to replication groups • Prototyped in 200 LoC Unpublished – Please do not distribute

HDFS Issues • Has the same issues as RAMCloud in managing groups of nodes • Issue: Repair bandwidth • Solution: Hybrid scheme • Issue: Network bottlenecks and load balancing • Solution: Kill replication group, re-replicate its data elsewhere • Issue:Replication group’s capacity is limited by node with the smallest capacity • Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute

Facebook’s HDFS Replication • Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss • Facebook’s Algorithm: • Primary replica is replicated on node j and rack k • Secondary replicas are replicated on randomly selected nodes among (j+1,… ,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute

Facebook’s Replication Unpublished – Please do not distribute

Hybrid MinCopysets • Split nodes into replication groups of 2 and 15 • First and second replica are always placed on the group of 2 • Third replica is randomly placed on the group of 15

Thank You! Stanford University

MinCopysets: Derandomizing Replication in Cloud Storage

MinCopysets: Derandomizing Replication in Cloud Storage

Presentation Transcript

DNA Replication

Replication

Public Private Cloud Storage Market is to reach $47 Bn by 18

Cloud Storage and Intel Power Management Usage Oriented Reference Architecture

Replication

Copysets : Reducing the Frequency of Data Loss in Cloud Storage

Reducing Risk with Cloud Storage

DNA Replication Redemption

Storage

DNA Replication

KEY VALUE STORAGE IN THE CLOUD

Cloud Storage for Media and Entertainment

Data Security for Cloud Storage Systems

Cloud Storage -: Co-related Entities Joining Hands to Make S

4 주차 수업자료

Veronia Bahaa

Virtual Storage

Performance Isolation and Fairness for Multi-Tenant Cloud Storage

Cloud Storage

DNA Replication

An outline of the best cloud storage in our site