1 / 13

MapReduce : Simplified Data Processing on Large Clusters

Jonathan Light. MapReduce : Simplified Data Processing on Large Clusters. Contents. Abstract Introduction Locality Task Granularity Backup Tasks Questions. Abstract: MapReduce. MapReduce processes and generates large data sets

keren
Download Presentation

MapReduce : Simplified Data Processing on Large Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jonathan Light MapReduce: Simplified Data Processing on Large Clusters

  2. Contents • Abstract • Introduction • Locality • Task Granularity • Backup Tasks • Questions

  3. Abstract: MapReduce • MapReduce processes and generates large data sets • Programs written for MapReduce are parallelized and run on a cluster of computers • Programmer doesn’t have to know how to use distributed systems

  4. Abstract: Google’s Implementation • Google uses “a large cluster of commodity machines” • A single job can process many terabytes of data • Nearly 1,000 jobs are ran every day

  5. Introduction • Computations have been written to process documents, web logs, and raw data • Computations are easy, but the data set is large • Google developed an abstraction to allow users run distributed computations without knowing about distributed computing

  6. Locality • Due to low network capacity data is stored on the local disk of each machine • The files are split into 64MB blocks. Each block is copied three times to other machines

  7. Locality continued • The master machine tries to schedule jobs on the machine that contains the data. • Otherwise, it schedules a job close to a machine containing the job data • It’s possible that a job may use no network bandwidth

  8. Task Granularity • The map and reduce phases are split into different size pieces • Total phase pieces should be larger than the number of worker machines • This helps with load balancing and recovery speed

  9. Task Granularity continued • Reduce phase pieces are usually constrained by users since each task is in a separate output file • The number of map phase pieces are chosen so that the input data size is between 16MB and 64MB • Google usually uses 200,000 map pieces, 5,000 reduce pieces, and 2,000 workers

  10. Backup Tasks • “Straggler” machines can cause large total computation time • Stragglers can be caused by many different reasons • Straggler alleviation is possible

  11. Backup Tasks continued • The master backs up in-progress tasks when the operation is close to finishing • Task is marked as complete when the primary task or backup completes • Backup task overhead has been tuned to a couple percent • An example task takes 44% longer when the backup is disabled

  12. Recap • Computations are easy, data is large • Small network utilization due to data locality and smart scheduling • Map and Reduce tasks are split into pieces • Straggler workers can arise, but these problems can be mitigated

  13. Questions?

More Related