Hadoop’s Adolescence: An analysis of Hadoop usage in scientiﬁc workloads

Hadoop’s Adolescence:An analysis of Hadoop usage in scientiﬁc workloads • Authors: Kai Ren, YongChul Kwon, Magdalena Balazinska, Bill Howe • Source: Proceedings of the VLDB Endowment VLDB Endowment pp. 853-864 Vol. 6 Issue 10, August 2013 • Citation: 3(google/VLDB) • Keywords: Hadoop Workload Analysis User Behavior Storage Load Balance • Reporter: Tien-Jing Wang • Date: 2014/3/29

Outline • Introduction • MapReduce • Hadoop Recap • Usage Analysis • Application • User Interaction • Storage • Workload Skew • Conclusion • Comment

Introduction • Growing data size • To fry bigger fish => Need bigger pan • Scale Up vs. Scale Out • Scale up • One Machine • Better CPU, HDD, RAM • Price grows exponentially • Scale out • Cheap CPU/HDD/RAM • Many machine work together • How to distribute a job to task for each machine?

MapReduce • MapReduce: Simplified Data Processing on Large Clusters(2004), Dean et al. • User gives two function • map (k1,v1) list(k2,v2) • reduce (k2,list(v2)) list(v2) • Framework deals the rest • Job flow • Job distribution • Fault tolerance

MapReduce Flow

What is Hadoop? • ‘Inspired’ from MapReduce since 2005 • Under Apache Foundation • Now top-level(high priority) project • Many project build on top of Hadoop • Open Source (Does not cost real money) • Many Distributions – Because it’s very…primitive • Hortonworks by Yahoo • Cloudera

Hadoop(1.X) Architecture Source: http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

Hadoop MapReduce Process • Input Split • Split input for each Map task • Map • Map operates on each input split • Partition • Mark each key/value pair for different reduce node • Sort and Combiner • Sort same partition together/Performs Combining operation • Shuffle • Reduce node fetches partition from map node • Merge/Sort • Reduce node merges partitions from • Reduce • Reduce function operate on input • Write back • Write result to HDFS

Hadoop MapReduce Process

Hadoop Cluster in Study • OpenCloud – Authors’ own hadoop cluster • 64 nodes • 2.8G Quad Core, 16G Ram, 10G Net, 4xHDD • Hadoop 0.20.1 • M45 – Yahoo’s Cluster • 400 nodes • 2* 1.86, Quad Core, 6G Ram, 4xHDD • Hadoop0.18 + Pig • Web Mining – Anonymous • 9 Nodes • 4 * Quad Core, 32G Ram, 4xHDD

Analysis Approach • Application • Tuning • Resource Usage

Application Workload • Types • What tool people use in Hadoop? • Structures • How many MapReduce cycle for an operation is executed. • Management • Pipeline(Batch) or Interactive

Application Types • What kind of tool people use with hadoop • Low-level API • MapReduce – Java native API • Pipes -- C/C++ API Interface • Streaming – Executable as Map/Reduce • High-level API • Scoobi – Scala with Hadoop • Scala is a functional language which runs on JVM, could use java library too! • Cascading -- Java abstract layer on top of Hadoop • Pegasus – Large(peta) scale graph mining • High-level Language • Pig – Interpreting language on top of Hadoop • Hive – Data warehouse on top of hadoop, support query • Canned MapReduce Tasks • Loadgen – Load testing • Mahout – Prepackaged data mining algorithm on Hadoop • Examples – examples came with Hadoop

Application Types by Jobs • Pig, Streaming and MapReduce is most popular • Pegasus is active due to class activity

Application Types by Users • (Lack of) Education in new tools hampers the use of advanced tools in favor of old, more tedious low level APIs • Legacy utility and language familiarity boost streaming utilization.

Application Types by Distinct Jobs • One-time data check or long term analysis?

Jobs vs. Distinct Jobs • High Jobs + Low Distinct Job= Repetitive • MapReduce is common for both type • Stream is difficult to use yet it’s been used as exploration analysis • Pig is a surprise, used for repetitive instead of exploration • Either unfit use of tools or tool is not optimal

Distinct Jobs: How Many? • Either one time only, or usually 100 times more • Optimization guidance.

Application Structures • Chain:If input of Job B is output from job A, We called it a chain • Directed Acyclic Graphic:Two or more Jobs’ input/output is output of another job

Application Structures

Application Structures: Distinct Datasets

Application Management: Pipeline • More or less, 20% of jobs are interactive • Despite Hadoop’s batch nature, many use Hadoop as interactive process, which is poor supported

Tuning • How much users use customization feature? • Very small percentage of user use it • Custom load balancing • Configuration Tuning

Custom Load Balancing: Map • Mahout outperforms due to it’s custom tuned skew resistence algorithm • BDBT = Balanced Data Balanced Runtime • U = Unbalanced

Custom Load Balancing: Reduce • BDBT = Balanced Data Balanced Runtime • U = Unbalanced

Configuration Tuning • Failure Parameter – How program react to error • OpenCloud • Increase retry threshold: 7 • Ignore bad input: 7 • Tolerable failed tasks: 1 • M45 • Tolerable fail tasks: 3 • Web Mining: All Default • JVM Option – Heap/Stack Size • OpenCloud:29, M45: 11, Web Mining: 3 • Speculative Execution – Spare straggler handling • M45: 2 • Sort Parameters – 4 parameters • Web Mining: 2 • M45: 1 • HDFS Parameters – Block size and replication factor • OpenCloud: 11 • M45: 2

User Job Customization: Observation • Most of user does not tinker the parameters, resulting non-optimal performance • Most of user only change parameter when they have no choice(i.e. error) • Enigmatic and complex parameters dissuade users from changing it • Or maybe it’s just good enough

Resource Usage • Leverage Hadoop for long-duration jobs over large dataset? • User usage profile • Data reuse

User Resource Profile • Three metrics • Task Time: Total Map/Reduce time spent • Data: Data size • Jobs: How many jobs? • 20% of users consumes 80-90% of resource

General Data Characteristics: Map Data Local Ratio • Large clusters has trouble getting their data from local machine

General Data Characteristics:Data Size • Small job dominance result low average data size(10M, 100M, 8G) • Default 64MB block size will be obstacle for this type of operation

Access Patterns:Access and Re-access • In large cluster, 10% path are accessed 90% times • 90% of the same file are re-accessed in one hour • Cache will help the performance

Access Patterns:Overwrite Frequency • In 5 min, 50% jobs of OpenCloud rewrite it’s output file.One hour in 90% jobs • Most space rewritten small, many reduce output have short life-span, with shuffle data, grouping it could result better performance/defragment

Access Patterns:Evaluate Cache • LFS(Least Frequent Used) + 1 Hour sliding window • For low accessing cluster(Web Mining) it might do more harm than good

Managing Skew in Hadoop • Authors: YongChul Kwon, Kai Ren, Magdalena Balazinska, and Bill Howe1 • Source: IEEE Data Eng. Bull. Vol. 36 No.1 pp. 24-33 (2013) • Citation: 3(google) • Keywords: Hadoop Load Balance

The Straggler Problem • Improper configured node, hardware malfunction will result way slower performance than usual • A task is a straggler if the task is 50% slower than median task of the same phase and the same job --Definition from Ananthanarayanan et al.

Prevalence of Straggler • 25% of jobs with 15% tasks as straggler

How Bad Straggler Slows You Down? • Some Map at 55%~65 have 2.5 slower straggler • All cluster’s reduce at 75% have 2.5 slower straggler • At extreme case straggler may slow 10~50000 times!

Default Partition Is Not Effective • Most user rely on default hash partition function • 5% reduce is empty reduce(not key assigned to this reduce)

In summary: • Imbalanced data distribution results unbalanced load • Even input doesn’t mean even load • Most user does not use manually balance • Results common and serve load balance issue

Speculative Execution • In other CS field, speculative execution means predict and execute an instruction before it’s there • In Hadoop, speculative execution means allocate spare slot for tasks may be straggler just in case. • A speculative execution is successful if started after the original but completed first • A speculative execution is unsuccessful if original is completed first, with three more types • Scheduled late: task started too late(<10%) • No Improvement: task does not improve much(10%) • Uncertain: insufficient data

Speculative Execution: Observation • Only 3~21% of speculation is successful

Speculative Execution: Successful Case • 5%~1000% improvement • Usually due to I/O failure

SkewReduce • By reading info from user’s input with cluster data, a cost function and estimate cost • Even when cost function is suboptimal, SkewReduce helps to improve load balance • Static nature and manual configuration is tiresome • ‘Intrusive’ load balancing strategy • Complicate the configuration • More moving part= More bugs possibility

SkewTune • Works as a wrapper of Hadoop. • Dynamically monitor straggler job and actively stop task and repartition the system to offer load balance on the fly

Conclusion • Studied three academy Hadoop usage • With all the data we seem, many features of Hadoop is still underused • Hadoop in academic research is still in its adolescence • Improve the usability will help

Comments • The paper’s cluster isn’t exactly ‘large’ by today’s standard • Hadoop 1.X topped 4000 nodes, even more for 2.X • Facebook have 3000+ node for Hive, that’s in 2011 • Let alone Google’s own implementation • Complex nature and poor interface hampers people’s ability to use Hadoop to it’s full potential • Poor documentation(over 9 years) • Rapid change constantly break each component • Hackish culture • One thing goes, everything goes, this is why distribution is there • ‘Works’ is blessing, not granted

Hadoop’s Adolescence: An analysis of Hadoop usage in scientiﬁc workloads