120 likes | 322 Views
Decoupling Storage and Computation in Hadoop with SuperDataNodes. George Porter UC San Diego La Jolla, CA 92093 gmporter@cs.ucsd.edu. -Sandeep Shiva. Contents: Introduction Existing architecture Advantages, Limitations SuperDataNode Advantages, Limitations Evaluation Results.
E N D
Decoupling Storage and Computation in Hadoop with SuperDataNodes George Porter UC San Diego La Jolla, CA 92093 gmporter@cs.ucsd.edu -Sandeep Shiva
Contents: • Introduction • Existing architecture • Advantages, Limitations • SuperDataNode • Advantages, Limitations • Evaluation Results
Introduction: • Rise in Data Intensive Computing • Data-parallel programming systems • Use of Hadoop is growing: • ADOBE, AOL,AMAZON,FACEBOOK….. • Fact: Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds!
Advantages of Hadoop: • It is able to process a portion of the data in parallel on each node, leading to very high scalability. • It relies on commodity server nodes and networking fabrics, reducing the cost of deployment considerably.
Limitations: • The ratio of computation to storage might change • over time, or might not be known in advance. • When the workload varies, it might be desirable • to power down or re-purpose some of the • Hadoop nodes for other applications which is not • possible in the existing structure as the data is spread • across all the nodes. Migration consumes time! • Note: Offloading a single terabyte off of a typical disk • over a gigabit link takes approximately three hours
Super Data Node: • It is a node with a large number of disks for storage compared to the traditional Hadoop node. • It Each VM is assigned its own network interface if 1 Gbit/sec links are used), or a portion of a network interface if 10 Gbit/sec links are used), and its own IP address. • Note: An experiment revealed that a SuperDataNode reduced total job execution time of a Sort workload by 17%, and a Grep workload by 54%.
Advantages: • Decouple amount of storage from number of • worker nodes . • • Support for “archival” data • – Subset of data with low probability of access • • Increased uniformity for job scheduling and block • placement. • • Ease of management • – Workers become stateless; SDN management • similar to that of a regular storage node.
Limitations: • Storage bandwidth between SuperDataNodes • and TaskTrackers is a scarce resource. • Effect on fault tolerance • Cost of SuperDataNodes
Evaluation: For Baseline: 10 SunFire TMX4150 Servers running OpenSolaris TM, each with 8 GB of memory and four 146GB SAS disk drives. ForSuperDataNode: SunFire TMX4540 Server configured with 64 GB of memory and 48 500GB SATA drives. Sort:17% less time Grep: 54% less time