1 / 19

Big Data - Computing

Big Data - Computing. Kalapriya Kannan IBM Research Labs July, 2013. Cloud. What is map reduce. Simple data – parallel programming model and framework. Designed for scalability and fault-tolerance Pioneered by Google Processes around 20 petabytes of data per day

argus
Download Presentation

Big Data - Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data - Computing Kalapriya Kannan IBM Research Labs July, 2013

  2. Cloud

  3. What is map reduce • Simple data – parallel programming model and framework. • Designed for scalability and fault-tolerance • Pioneered by Google • Processes around 20 petabytes of data per day • Popularized by open-source Hadoop project. • Used at yahoo!, Facebook, Amazon. • Map reduce design goals • Scalability to large data volume • 1000’s of machines, 10000’s of disks • Cost efficient • Commodity machines (cheap, but unreliable) • Commodity network • Automatic fault-tolerance • Easy to use

  4. Map Reduce Programming Model • Input data type: file of key-value records

  5. Example

  6. Simple Map/Reduce Workflow • Push : input split into large chunks and placed on local disks of cluster nodes. • Map: chunks (map tasks) are served to “mapper” • Prefer mapper that has data locally • Mappers save outputs to local disk before serving them to reducers; allows recovery • Reducers: “reducers” execute reduce tasks when map phase complete.

  7. Map Reduce Framework

  8. Partitioning /Shuffling • Divide intermediate key space across reducers • K reduce tasks => k partitions (simple hash function) • E.g. k=3, keys {1,2}{3,4}{5,6} • Shuffle/Exchange • Since all mappers typically have all intermediate keys • All to all communication • Serial workflow by default. • Research groups have explored pipelining.

  9. Combiners • A combiner is a local aggregation function for repeated keys produced by same map.

  10. Word Count with Combiner

  11. Fault Tolerance in MapReduce • If a Task crashes: • Retry on another node: • OK for a map because it has no dependencies • OK for a reduce because map outputs are on disk • If a node crashes: • Re-launch its current task on other nodes • Re-run any maps the node previously ran to get output data • If a task is going slowly (straggler): • Launch second copy of task on another node (“speculative execution”)

  12. Example: Inverted Index • Input : (filename, text) records • Output: list of files containing each word

  13. Inverted Index Example

  14. What applications may perform well? • Modest computing relative to data • Data –independent processing of maps • Embarrassingly data parallel. • Data – independent processing of keys • Smaller ballooning of map output relative to input

  15. What is Hadoop? • MapReduce Implementation • Opensource Apache project • Implemented in Java • Primary data analysis platform at yahoo! • 40,000+ machines running Hadoop

  16. Typical Hadoop Cluster • 40 nodes/rack, 1000-4000 nodes in cluster • 1 Gbps bandwidth within rack, 8 Gbps out of rack • Node Spec: • 8 *2GHz cores, 8 GB RAM, 4 disks (=4 TB)

  17. Hadoop Components • HDFS • Map Reduce • Batch computation Framework • Tasks re-executed on failure • Optimizes for data locality of input.

  18. Key MapReduce Terminology Concepts • A user runs a client program on a client computer • The client program submits a job to Hadoop • The job is sent to the JobTracker Process on the Master Node • Each Slave Node runs a process called the Task Tracker • The job Tracker instructs Task Tracker to run and monitor tasks • A task attempt is an instance of a task running on a slave node • There will be at least as many tasks attempts as there are tasks which need to be performed.

  19. Simple experiment to run a map/reduce program. • Example : • Word count • In-built and easy to understand.

More Related