150 likes | 267 Views
15-440, Hadoop Distributed File System Allison Naaktgeboren. Ur doin' it rong kitteh. Wut u mean? I iz loadin a HA-doop fileh. Annoucements. Go Vote! Interpretive Dances happen only after Lecture Office Hour Change Mon: 6:30-9:30 Tues: 6-7:30 Exams are graded.
E N D
15-440, Hadoop Distributed File SystemAllison Naaktgeboren • Ur doin' it rong kitteh • Wut u mean? I iz loadin a HA-doop fileh
Annoucements • Go Vote! • Interpretive Dances happen only after Lecture • Office Hour Change • Mon: 6:30-9:30 • Tues: 6-7:30 • Exams are graded
Back to the Map Reduce Model • Recall that • map (in_key, in_value) -> (inter_key, inter_value) list combine (inter_key, inter_value) → (inter_key, inter_value) • reduce (inter_key, inter_value list) -> (out_key, out_vlaue) • What resource are we most constrained by? • “Oceans of Data, Skinny pipes” • How many types of data will the file system care about? • How long will we need each kind? • What is the common case for each?
What would a MR Filesytem need? • General Use case: large files • Mostly append to end, long sequential reads, few deletes • Appends might be concurrent • Scability • Adding (or losing) machines should be relatively painless • Nodes work on nearby data • Minimize moving data between machines • Bandwidth is our limiting resource • Remember how much data • Failure (handling)is Common • Yea, yea we know, we took 213, we know hardware sucks • No, really failure (handling) is common (constant) • Disks, processors,whole nodes, racks, and datacenters
Addressing Those Concerns • Sequential Reads, appends need to be fast • Deletes can be painful • “Hot plug” machines • Add or lose machines while system is running jobs • System should auto detect the change • HDFS should distribute data somewhat evenly • So that all workers have a reasonable amount of data to chew on • And coordinating with the Jobtracker (job master) • Data Replication • Should be spread out. Why? • What type of problems could arise?
Moving into the Details • Nodes in HDFS • NameNode (master) ( like GFS Master) • DataNodes (slaves) ( like GFS chunkservers) • NB – Hadoop and HDFS closely paired • “careful use of jargon defines the true expert” • “worker node A” and “data node 1” are frequently the same machine • Two types of Masters • Jobtracker (Hadoop Job Master) • NameNode (file system Master) • What I mean by 'master' for the rest of the lecture
Your Data goes in .... • Files are divided into Chunks • 64 MB • The mapping between filename and chunks goes to the Master • Each chunk is replicated and sent off to DataNodes • By default, 3 • The master determines which dataNodes
What the Clients Do • Where the data starts • On file creation creates a seperate file w/checksum • When data fetched back from a dataNode, checksum computed again • Cache file data • Avoid bothering the Master too often • When a Client has 1 chunk's worth of data • Contacts the Master, • Master sends name of dataNodes to send it to • ONLY sends it to the 1st
What the DataNodes Do • Heartbeat to the Master • Opens, closes, or replicates a chunk if requested from Master • During replication, sends data to next dataNode in chain
What the Namespace Node Does • System metadata! • Holds Name->ID mapping • Chunk replicas locations • Transcation Logs • EditLog • FSImage • It is responsible for coherency • Uses the logs atomically • Addresses the conccurent writes issue • It is checkpointed • Similar to AFS volume snapshots • Will pull last consistent log upon restart
What the Namespace Node Does • Listens for Heartbeats • Listens for Client Requests • If no heartbeat • marks a node as dead • Its data is deregistered • It selects dataNodes • Which nodes get which chunks • Signals creating, opening, closing • Deletes • Orders move to /trash • Starts delete timer
Additional Resources • Hadoop wiki • Youtube → “Hadoop” → Google developer videos (1-3 will be helpful) • Google University • Includes UW course, the other UW course, a couple others • Use are your own risk • “The Google File System” paper is rather readable as research papers go