Design Principles and Architecture of Distributed File Systems

HDFS/GFS

Outline • Requirements for a Distributed File System • HDFS • Architecture • Read/Write • Research Directions • Popularity • Failures • Network

Properties of a Data Center • Servers are built from commodity devices • Failure is extremely common • Servers only have a limited amount of HDD space • Network is over-subscribed • The bandwidth between servers is different • Demanding applications • High throughput, low latency • Resources are grouped into failure-zone • Independent units of failure

Data-Center Architecture 10GB 25GB 100GB

Properties of a Data Center • Servers are built from commodity devices • Failure is extremely common • Servers only have a limited amount of HDD space • Network is over-subscribed • The bandwidth between servers is different • Demanding applications • High throughput, low latency • Resources are grouped into failure-zone • Independent units of failure

Data-Center Architecture Failure Domain 1 Failure Domain 2

Goals for a Data Center File System • Reliable • Over come server failures • High performing • Provide good performance to application • Aware of network disparities • Make data local to the applications

Common Design Principles • For performance: Partitioning the data • Split data into chunks and distribute • provides high throughput • Many people can read the chunks in parallel • Better than everyone one reading the same file How data is partitioned across nodes • For reliability: Replication: • overcome failure by making copies • At least one copy should be online How data is duplicated across nodes • For Network-disparity: rack-aware allocation • Read from the closest block • Write to the closest location

HDFS Architecture • Name Node – Master (only 1 in a data center) • All reads/write go through the master • Manages the data nodes • Detects failures – triggers replication • Tracks performance • Tracks location of blocks • Data Node – • One per server • Stores the blocks • Tracks block to node mapping • Tracks status of data nodes • Rebalances the data center • Orchestrates read/writes Name Node • Tracks status of blocks • Ensures integrity of block Data Node B Data Node Data Node B B` B`

What is a Distributed FS Write? • HDFS • For high-performance • Make N copies of the data to be written • Default N= 3 B HDFS Master Write B B B

What is a Distributed FS Write? • HDFS • For Fault tolerance • Place in two different fault domains • 2 copies in the same rack • 1 in a different rack Zone 1 Zone 2 B B B

What is a Distributed FS Write? • HDFS • For Network awareness • Currently does nothing Picks two random racks

What is a Distributed FS Read? • HDFS • For Network awareness/performance • Pick closest copy to read from. • Nothing specific for Reliability Name Node Read B Zone 1 Zone 2 B B B

Implications of Read/Write Semantics • One application write == 3 HDFS writes • Writes are costly!! • HDFS is optimized for write-once/read-many times workloads • What is an update/edit? Rewrite blocks? Name Node Modify B Zone 1 Zone 2 B B B

Implications of Read/Write Semantics • One application write == 3 HDFS writes • Writes are costly!! • HDFS is optimized for write-once/read-many times workloads • An update/Edit: • delete old data + write new data Name Node Modify B B B B B` B` B`

Interesting Challenges • How happens with more popular blocks? • Or less popular blocks? • What happens during server failures? • Can you loose data? • What happens if you have a better network? • No oversubscription

Popularity in HDFS • Not all files are equivalent • E.g. More people search for bball than hockey • More popular blocks will have more contention • Leads to slower performance • Search for bball will be slower

Popularity in HDFS • # of copies of a block = function(popularity) • If 50 people search for bball, then make 50 blocks • If only 3 search for hockey, then make 3 • You want as many copies of a block as readers

Popularity in HDFS • As data becomes old less people care about it • So last year’s weather versus today’s weather • When a block becomes old (older than a week) • Reduce the number of copies. • In Facebook data centers, only one copy of old data

Failures in Data Center • Do servers fail???? • Facebook: 1% of servers fail after-reboot • Google: at least one server fails a day Name Node B Data Node Data Node B` • Failed node doesn’t send heart beat • Name node determines blocks on failed node • Starts replication. B Data Node Data Node B B` B`

Failures in Data Center • Do servers fail???? • Facebook: 1% of servers fail after-reboot • Google: at least one server fails a day Name Node • Failed node doesn’t send heart beat • Name node determines blocks on failed node • Starts replication. B Data Node B Data Node Data Node Data Node B B` B` B`

Problems With Locality aware DFS • Ignores contention on the servers • I/O contention greatly impacts performance

Problems With Locality aware DFS • Ignores contention on the servers • I/O contention greatly impacts performance • Ignores contention in the network • Similar performance degradation

Types of Network Topologies • Current Networks • Uneven B/W everywhere • Future Networks • Even B/W everywhere 10GB 100GB 25GB 100GB 100GB 100GB

Implications of Network Topologies • Blocks can be more spread out! • No need for two blocks within the same rack • Same BW everywhere so no need for locality aware placement

Summary • Properties for a DFS • Research Challenges • Popularity • Failure • Data Placement

Un-discussed • Cluster rebalancing • Move blocks around based on utilization. • Data integrity • Use checksum to check if data has gotten corrupted. • Staging + pipeline

Design Principles and Architecture of Distributed File Systems