Big Data

Big Data - Computing Kalapriya Kannan IBM Research Labs July, 2013

Cloud

What is map reduce • Simple data – parallel programming model and framework. • Designed for scalability and fault-tolerance • Pioneered by Google • Processes around 20 petabytes of data per day • Popularized by open-source Hadoop project. • Used at yahoo!, Facebook, Amazon. • Map reduce design goals • Scalability to large data volume • 1000’s of machines, 10000’s of disks • Cost efficient • Commodity machines (cheap, but unreliable) • Commodity network • Automatic fault-tolerance • Easy to use

Map Reduce Programming Model • Input data type: file of key-value records

Example

Simple Map/Reduce Workflow • Push : input split into large chunks and placed on local disks of cluster nodes. • Map: chunks (map tasks) are served to “mapper” • Prefer mapper that has data locally • Mappers save outputs to local disk before serving them to reducers; allows recovery • Reducers: “reducers” execute reduce tasks when map phase complete.

Map Reduce Framework

Partitioning /Shuffling • Divide intermediate key space across reducers • K reduce tasks => k partitions (simple hash function) • E.g. k=3, keys {1,2}{3,4}{5,6} • Shuffle/Exchange • Since all mappers typically have all intermediate keys • All to all communication • Serial workflow by default. • Research groups have explored pipelining.

Combiners • A combiner is a local aggregation function for repeated keys produced by same map.

Word Count with Combiner

Fault Tolerance in MapReduce • If a Task crashes: • Retry on another node: • OK for a map because it has no dependencies • OK for a reduce because map outputs are on disk • If a node crashes: • Re-launch its current task on other nodes • Re-run any maps the node previously ran to get output data • If a task is going slowly (straggler): • Launch second copy of task on another node (“speculative execution”)

Example: Inverted Index • Input : (filename, text) records • Output: list of files containing each word

Inverted Index Example

What applications may perform well? • Modest computing relative to data • Data –independent processing of maps • Embarrassingly data parallel. • Data – independent processing of keys • Smaller ballooning of map output relative to input

What is Hadoop? • MapReduce Implementation • Opensource Apache project • Implemented in Java • Primary data analysis platform at yahoo! • 40,000+ machines running Hadoop

Typical Hadoop Cluster • 40 nodes/rack, 1000-4000 nodes in cluster • 1 Gbps bandwidth within rack, 8 Gbps out of rack • Node Spec: • 8 *2GHz cores, 8 GB RAM, 4 disks (=4 TB)

Hadoop Components • HDFS • Map Reduce • Batch computation Framework • Tasks re-executed on failure • Optimizes for data locality of input.

Key MapReduce Terminology Concepts • A user runs a client program on a client computer • The client program submits a job to Hadoop • The job is sent to the JobTracker Process on the Master Node • Each Slave Node runs a process called the Task Tracker • The job Tracker instructs Task Tracker to run and monitor tasks • A task attempt is an instance of a task running on a slave node • There will be at least as many tasks attempts as there are tasks which need to be performed.

Simple experiment to run a map/reduce program. • Example : • Word count • In-built and easy to understand.

Big Data - Computing

Big Data - Computing

Presentation Transcript

25.Big Data with Hadoop and Cloud Computing

Big Data and Clouds: Computing, Analytics and Curriculum

Adopting Big-Data Computing Across the Undergraduate Curriculum

Big Data, Future of Computing, Parting Thoughts

6 . Big Data and Cloud Computing

Cloud Computing and Big Data Processing

Big Data

Big Data

System Software for Big Data Computing

Big Data Infrastructure for Scientific Computing

Big Data

Big-data Computing: Hadoop Distributed File System

Big Data

Big Data

Big Data Solutions - Cloud Computing Hadoop Infrastructure Solutions.

Big Data Training | Big Data Courses | Big Data Online Courses

Big Data Big Data

Big-data Computing: Hadoop Distributed File System

Big-data Computing

System Software for Big Data Computing