Teaching HDFS/MapReduce Systems Concepts to Undergraduates

Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson Computing and Information Technology, Clemson University

Contents • Introduction and Learning Objectives • Challenges • Hadoop computing platform – Options and Solution • Module Content – Lectures, Assignments, Data • Student Feedback • Module Content – Project • Ongoing and Future Work

Introduction andLearning Objectives • Hadoop/MapReduce is an important current technology in the area of data-intensive computing • Learning objectives: • Understand the challenges of data-intensive computing • Become familiar with the Hadoop Distributed File System (HDFS), the underlying driver of MapReduce • Understand the MapReduce (MR) programming model • Understand the scalability and performance of MR programs on HDFS

Challenges • Provide students with a high performance, stable, and robust Hadoop computing platform • Balance lecture and hands-on lab hours • Demonstrate the technical relationship between MapReduce and HDFS

Computing Platform Options • MapReduce parallel programming interface • WebMapReduce is an example • Enables study of MR programming model at beginning level • Does not enable the study of HDFS for advanced students • Dedicated shared Hadoop cluster with individual accounts • Multiple student programs compete for resources • Individual errors affect other students • Dedicated cluster that supports multiple virtual Hadoop clusters • Not supported by Clemson’s supercomputer configuration

Computing Platform Solution • Modification of SDSC’s myHadoop • Individual Hadoop platform deployment for each student in the class • First setup: • Medium amount of editing needed to set up • Numerous errors due to typos/unable to configure • Second setup: • Minimal amount of editing needed (one line) • Only a few students encountered errors due to typos

Lecture and Hands-on Labs • Fall 2012: 5 class hours • 1 MR lecture, 1 lab, 1HDFS lecture, 1 lab, 1 advanced MR optimization • Lab time not sufficient due to problems with Hadoop computing platforms • Spring 2013: 5 class hours • Lab time still not sufficient, due to errors in modifying myHadoop scripts • Fall 2013: 7 class hours • 1 MR lecture, 2 labs, 1 HDFS lecture, 2 labs, 1 HBase/Hive lecture

Module Content: Lectures • Reused available online material with addition clarification • Slides from UMD, Jimmy Lin • Strong emphasis on the following points: • The MR programming paradigm is a programming model that handles data parallelization • The HDFS infrastructure provides a robust and resilient way to distribute big data evenly across a cluster • The MR library takes advantages of HDFS and the MR programming paradigm to enable programmers to write applications to conveniently and transparently handle big data • Data locality is the big theme in working with big data

Module Content: Lectures HDFS Abstractions: Directories/Files Block metadata lives in memory File 01 File 02 File 03 HDFS Blocks DataNodes report block information to NameNode Physical View at Linux FS: blk_xxx blk_xxx blk_xxx NameNode JobTracker provides NameNode with file/directory paths and receives block-level information. RAM RAM RAM RAM RAM RAM Could be the same machine HDFS DataNode daemons controlling block location HDD HDD HDD HDD HDD HDD CPU CPU CPU CPU CPU CPU JobTracker MapReduce TaskTracker daemons executing tasks on blocks Detailed job progress lives in memory • TaskTrackers report progress to JobTracker • JobTracker assigns work and faciltate map/reduce on TaskTrackers based on block location information from NameNode DataNode DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker TaskTracker

Module Content: Assignments and Data • Assignments • One MR programming assignment basing on existing codes that familiarizes students with the MR API and programming flows • One MR/HDFS programming assignment that requires students to write a MR program and deploy it to run on a Hadoop computing platform • Data • Strive to be realistic • Big enough, but not too big • Airline Traffic Data (12Gb), Google Trace (200Gb), Yahoo Music Rating (10Gb), Movie Rating (250Mb)

Student Feedback • In-class voluntary surveys help to encourage all students to participate (as compared to out-of-class online survey) • IRB approval for survey • Questions addressing: • Improvements in technical skills • Improvements in understanding about Hadoop/MR • Time taken to complete Hadoop/MR assignments • Time taken to set up Hadoop on Palmetto • Usefulness of guides/lectures/labs • Relevancy of Hadoop/MR topics • Appropriate level to begin teaching Hadoop/MR

Student Feedback

Student Feedback Primary student requests: • Fall 2013 • More labs • More details in HDFS guide • Spring 2014 • FAQ to address common configuration errors/interpretation of MR compilation errors • More time for projects • Reduced dependency between two Hadoop/MR assignments

Module Content: Project • Was added to the course in Spring 2014 • Project in place of assignments • Three categories: • Data Analytics • Big data set • Interesting analytic problem relating to data • Performance Comparison • Big data set • Comparison between Hadoop MapReduce and MPI • System implementation • Augmenting myHadoop with additional software modules: Spark, HBase, or Hadoop 2.0 • Required IEEE two-column conference format for reports

Module Content: Project • Data Sets: • Airline Traffic Data (12Gb) • NOAA Global Daily Weather Data (15-20Gb) • Amazon Food Reviews (354Mb – hundreds of thousands of entries) • Amazon Movie Reviews (8.7Gb – millions of entries) • Meme Trackers (53Gb - texts) • Million Song Dataset (196Gb HD5 compressed) • Google Trace Data (~171Gb)

Module Content: Project • Comparing performance between Hadoop and MPI-MR (Sandia) using Amazon Movie Reviews • Configuration and installation of Hadoop 2.0 on myHadoop • Amazon Crawler using iterative implementation of Hadoop MR • Performance comparison between Hadoop/MPI/MPI-IO on NOAA data • Performance comparison between Hadoop/MPI/MPI-IO on Google Trace data

Module Content: Project • Positive Evaluation • Appropriateness of scope: 8.17/10 • Appropriateness of difficulty: 7.74/10 • Applicability of Hadoop/MR: 8.94/10 • Student Feedback • An integral element of the module/course • More time is needed • Start the project earlier in the semester • Less assignment, more project

Ongoing Work • Transition to Hadoop 2.0 • Inclusion of other current distributed and data-intensive technologies: • Spark/Shark for in-memory computing • Cascade/Tez for workflow computing • Swift? • Inclusion of additional real world data and problems in student projects

Questions? Fall 2012: https://sites.google.com/a/g.clemson.edu/cp-cs-362/ Spring 2013: https://sites.google.com/a/g.clemson.edu/cpsc362-sp2013/ Fall 2013: https://sites.google.com/a/g.clemson.edu/cpsc362-fa2013/ Spring 2014: https://sites.google.com/a/g.clemson.edu/cpsc3620-sp2014/

Teaching HDFS/MapReduce Systems Concepts to Undergraduates