190 likes | 335 Views
Big Data - Computing. Kalapriya Kannan IBM Research Labs July, 2013. Cloud. What is map reduce. Simple data – parallel programming model and framework. Designed for scalability and fault-tolerance Pioneered by Google Processes around 20 petabytes of data per day
E N D
Big Data - Computing Kalapriya Kannan IBM Research Labs July, 2013
What is map reduce • Simple data – parallel programming model and framework. • Designed for scalability and fault-tolerance • Pioneered by Google • Processes around 20 petabytes of data per day • Popularized by open-source Hadoop project. • Used at yahoo!, Facebook, Amazon. • Map reduce design goals • Scalability to large data volume • 1000’s of machines, 10000’s of disks • Cost efficient • Commodity machines (cheap, but unreliable) • Commodity network • Automatic fault-tolerance • Easy to use
Map Reduce Programming Model • Input data type: file of key-value records
Simple Map/Reduce Workflow • Push : input split into large chunks and placed on local disks of cluster nodes. • Map: chunks (map tasks) are served to “mapper” • Prefer mapper that has data locally • Mappers save outputs to local disk before serving them to reducers; allows recovery • Reducers: “reducers” execute reduce tasks when map phase complete.
Partitioning /Shuffling • Divide intermediate key space across reducers • K reduce tasks => k partitions (simple hash function) • E.g. k=3, keys {1,2}{3,4}{5,6} • Shuffle/Exchange • Since all mappers typically have all intermediate keys • All to all communication • Serial workflow by default. • Research groups have explored pipelining.
Combiners • A combiner is a local aggregation function for repeated keys produced by same map.
Fault Tolerance in MapReduce • If a Task crashes: • Retry on another node: • OK for a map because it has no dependencies • OK for a reduce because map outputs are on disk • If a node crashes: • Re-launch its current task on other nodes • Re-run any maps the node previously ran to get output data • If a task is going slowly (straggler): • Launch second copy of task on another node (“speculative execution”)
Example: Inverted Index • Input : (filename, text) records • Output: list of files containing each word
What applications may perform well? • Modest computing relative to data • Data –independent processing of maps • Embarrassingly data parallel. • Data – independent processing of keys • Smaller ballooning of map output relative to input
What is Hadoop? • MapReduce Implementation • Opensource Apache project • Implemented in Java • Primary data analysis platform at yahoo! • 40,000+ machines running Hadoop
Typical Hadoop Cluster • 40 nodes/rack, 1000-4000 nodes in cluster • 1 Gbps bandwidth within rack, 8 Gbps out of rack • Node Spec: • 8 *2GHz cores, 8 GB RAM, 4 disks (=4 TB)
Hadoop Components • HDFS • Map Reduce • Batch computation Framework • Tasks re-executed on failure • Optimizes for data locality of input.
Key MapReduce Terminology Concepts • A user runs a client program on a client computer • The client program submits a job to Hadoop • The job is sent to the JobTracker Process on the Master Node • Each Slave Node runs a process called the Task Tracker • The job Tracker instructs Task Tracker to run and monitor tasks • A task attempt is an instance of a task running on a slave node • There will be at least as many tasks attempts as there are tasks which need to be performed.
Simple experiment to run a map/reduce program. • Example : • Word count • In-built and easy to understand.