Distributed Systems

Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012

Windows Azure Storage (WAS) • A scalable cloud storage system • In production since November 2008 • used inside Microsoft for applications such as • social networking search, serving video, music and game content, managing medical records and more • Thousands of customers outside Microsoft • Anyone can sign up over the Internet to use the system.

WAS Abstractions • Blobs – File system in the cloud • Tables– Massively scalable structured storage • Queues – Reliable storage and delivery of messages • A common usage pattern is incoming and outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.

Design goals • Highly Available with Strong Consistency • Provide access to data in face of failures/partitioning • Durability • Replicate data several times within and across data centers • Scalability • Need to scale to exabytes and beyond • Provide a global namespace to access data around the world • Automatically load balance data to meet peak traffic demands

Global Partitioned Namespace • http(s)://AccountName.<service>.core.windows.net/PartitionName/ ObjectName • <service> can be a blob, table or queue. • AccountNameis the customer selected account name for accessing storage. • The Account name specifies the data center where the data is stored. • An application may use multiple AccountNames to store its data across different locations. • PartitionNamelocates the data once a request reaches the storage cluster • When a PartitionNameholds many objects, the ObjectNameidentifies individual objects within that partition • The system supports atomic transactions across objects with the same PartitionNamevalue • The ObjectName is optional since, for some types of data, the PartitionNameuniquely identifies the object within the account.

Storage Stamps • A storage stamp is a cluster of N racks of storage nodes. • Each rack is built out as a separate fault domain with redundant networking and power. • Clusters typically range from 10 to 20 racks with 18 disk-heavy storage nodes per rack. • The first generation storage stamps hold approximately 2PB of raw storage each. • The next generation stamps hold up to 30PB of raw storage each.

High Level Architecture Access blob storage via the URL: http://<account>.blob.core.windows.net/ Storage Location Service Data access LB LB Storage Stamp Storage Stamp Front-Ends Front-Ends Partition Layer Partition Layer Inter-stamp (Geo) replication Stream Layer Stream Layer Intra-stamp replication Intra-stamp replication

Storage Stamp Architecture – Stream Layer • Append-only distributed file system • All data from the Partition Layer is stored into files (extents) in the Stream layer • An extent is replicated 3 times across different fault and upgrade domains • With random selection for where to place replicas • Checksum all stored data • Verified on every client read • Re-replicate on disk/node/rack failure or checksum mismatch M Stream Layer (Distributed File System) Paxos M M Extent Nodes (EN)

Storage Stamp Architecture – Partiton Layer • Provide transaction semantics and strong consistency for Blobs, Tables and Queues • Stores and reads the objects to/from extents in the Stream layer • Provides inter-stamp (geo) replication by shipping logs to other stamps • Scalable object index via partitioning Partition Master Partition Layer Lock Service Partition Server Partition Server Partition Server Partition Server

Storage Stamp Architecture – Front End Layer • Stateless Servers • Authentication + authorization • Request routing

Storage Stamp Architecture Incoming Write Request Ack Front End Layer FE FE FE FE FE Partition Master Lock Service Partition Layer Partition Server Partition Server Partition Server Partition Server M M Paxos Stream Layer M Extent Nodes (EN)

Partition Layer – Scalable Object Index • 100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp • Need to efficiently enumerate, query, get, and update them • Traffic pattern can be highly dynamic • Hot objects, peak load, traffic bursts, etc • Need a scalable index for the objects that can • Spread the index across 100s of servers • Dynamically load balance • Dynamically change what servers are serving each part of the index based on load

Scalable Object Index via Partitioning • Partition Layer maintains an internal Object Index Table for each data abstraction • Blob Index: contains all blob objects for all accounts in a stamp • Table Entity Index: contains all table entities for all accounts in a stamp • Queue Message Index: contains all messages for all accounts in a stamp • Scalability is provided for each Object Index • Monitor load to each part of the index to determine hot spots • Index is dynamically split into thousands of Index RangePartitions based on load • Index RangePartitions are automatically load balanced across servers to quickly adapt to changes in load

Partition Layer – Index Range Partitioning Blob Index • Split index into RangePartitions based on load • Split at PartitionKey boundaries • PartitionMap tracks Index RangePartition assignment to partition servers • Front-End caches the PartitionMap to route user requests • Each part of the index is assigned to only one Partition Server at a time Storage Stamp Partition Map Partition Master A-H: PS1 H’-R: PS2 R’-Z: PS3 A-H: PS1 H’-R: PS2 R’-Z: PS3 Partition Server A-H Front-End Server PS 1 Partition Server Partition Server R’-Z H’-R PS 2 PS 3 Partition Map

Partition Layer – RangePartition • A RangePartition uses a Log-Structured Merge-Tree to maintain its persistent data. • RangePartition consists of its own set of streams in the stream layer, and the streams belong solely to a given RangePartition • Metadata Stream – The metadata stream is the root stream for a RangePartition. • The PM assigns a partition to a PS by providing the name of the RangePartition’s metadata stream • Commit Log Stream – Is a commit log used to store the recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition. • Row Data Stream – Stores the checkpoint row data and index for the RangePartition.

Stream Layer • Append-Only Distributed File System • Streams are very large files • Has file system like directory namespace • Stream Operations • Open, Close, Delete Streams • Rename Streams • Concatenate Streams together • Append for writing • Random reads

Stream Layer Concepts Stream //foo/myfile.data Ptr E1 Ptr E2 Ptr E3 Ptr E4 Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Extent E2 Extent E3 Extent E4 Extent E1 unsealed Stream Hierarchical namespace Ordered list of pointers to extents Append/Concatenate Extent Unit of replication Sequence of blocks Size limit (e.g. 1GB) Sealed/unsealed Block Min unit of write/read Checksum Up to N bytes (e.g. 4MB) sealed sealed sealed

Creating an Extent Paxos Create Stream/Extent SM Partition Layer SM Stream Master EN1 PrimaryEN2, EN3 Secondary Allocate Extent replica set EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B

Replication Flow Paxos Partition Layer SM EN1 PrimaryEN2, EN3 Secondary SM SM Append Ack EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B

Providing Bit-wise Identical Replicas • Want all replicas for an extent to be bit-wise the same, up to a committed length • Want to store pointers from the partition layer index to an extent+offset • Want to be able to read from any replica • Replication flow • All appends to an extent go to the Primary • Primary orders all incoming appends and picks the offset for the append in the extent • Primary then forwards offset and data to secondaries • Primary performs in-order acks back to clients for extent appends • Primary returns the offset of the append in the extent • An extent offset can commit back to the client once all replicas have written that offset and all prior offsets have also already been completely written • This represents the committed length of the extent

Dealing with Write Failures Stream //foo/myfile.dat Ptr E1 Ptr E2 Ptr E3 Ptr E5 Ptr E4 ? Extent E5 Extent E1 Extent E2 Extent E3 Extent E4 Failure during append • Ack from primary lost when going back to partition layer • Retry from partition layer can cause multiple blocks to be appended (duplicate records) • Unresponsive/Unreachable Extent Node (EN) • Append will not be acked back to partition layer • Seal the failed extent • Allocate a new extent and append immediately

Extent Sealing (Scenario 1) Paxos Seal Extent Seal Extent Sealed at 120 Partition Layer SM SM Stream Master Append 120 Ask for current length 120 EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 1) Paxos Seal Extent Sealed at 120 Partition Layer SM SM Stream Master Sync with SM 120 EN 1 EN 2 EN 4 EN 3 Primary Secondary A Secondary B

Extent Sealing (Scenario 2) Paxos Seal Extent Seal Extent Sealed at 100 Partition Layer SM SM SM Append Ask for current length 120 100 EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 2) Paxos Seal Extent Sealed at 100 Partition Layer SM SM SM 100 Sync with SM EN 4 EN 1 EN 2 EN 3 Primary Secondary A Secondary B

Providing Consistency for Data Streams Partition Server SM SM SM Safe to read from EN3 EN 1 EN 2 EN 3 • Network partition • PS can talk to EN3 • SM cannot talk to EN3 Primary Secondary A Secondary B • For Data Streams, Partition Layer only reads from offsets returned from successful appends • Committed on all replicas • Row and Blob Data Streams • Offset valid on any replica

Providing Consistency for Log Streams Check commit length Partition Server SM SM Use EN1, EN2 for loading SM Seal Extent Check commit length EN 1 EN 2 EN 3 • Network partition • PS can talk to EN3 • SM cannot talk to EN3 Primary Secondary A Secondary B • Logs are used on partition load • Commit and Metadata log streams • Check commit length first • Only read from • Unsealed replica if all replicas have the same commit length • A sealed replica

Summary • Highly Available Cloud Storage with Strong Consistency • Scalable data abstractions to build your applications • Blobs – Files and large objects • Tables – Massively scalable structured storage • Queues – Reliable delivery of messages • More information at: • http://www.sigops.org/sosp/sosp11/current/2011-Cascais/11-calder-online.pdf

Distributed Systems