Poly Hadoop

Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko

Accomplishments • ITS Grid Account • OpenPBS, Java, Subversion, Bash, Perl, Vim • Hadoop on ITS Grid Account • HDFS, Node Configurations • MapReduce Code • Hadoop Running Natively on ITS Grid • Hadoop on VMware Images • Fedora 6, Image & Hadoop Configuration

Grid Properties • All Jobs Queued Through Management Node • qsub <resource_list> script.bsh • Resource list can include which physical node assignment, number of processors, allowed execution time, etc. • Script Executes on Only One Physical Node • User Environment Replicated on All Nodes • Shared File System

Hadoop on GridIssues & Solutions • Shared File System vs. Local File System • Issues • Single Configuration File Shared by All Hadoop Nodes • Hadoop DataNodes Need “Local” Directories • The File System is Shared • Solution • Create Separate Directories Using Node’s HostName • Supply HostName via Java System Properties • Use Java System Property Expansion in Hadoop Configuration File

Hadoop on GridIssues & Solutions (cont.) • Pseudo-Dynamic namenode Selection • Issues • Physical Node Assignments Not Guaranteed • Hadoop Configuration File Specifies Nodes to Use • Solution • On-the-Fly Modification of Hadoop Configuration File • Yay for XML! • On-the-Fly Modification of Hadoop masters and slaves Files

Hadoop on Grid Scripts • run_createdirs.sh • Creates dirs for each physical node • update_sitexml.pl • Dynamically updates hadoop-site.xml • run_real_test.sh • Formats HDFS • Starts job management and DFS • Puts dataset on DFS • Runs MapReduce jobs • Exports output • Stops MapReduce and DFS

MapReduce Progress • Pushing Dataset Onto Hadoop FS • Simple Command Done in qsub Script • MapReduce Java Code • Selecting Number Of Jobs • Map Jobs = 10 Per Node • Reduce Jobs = 2 Per Node

Map Code public class UserRatingMapper extends MapReduceBase implements Mapper { private static Pattern userRatingDate = Pattern.compile("^(\\d+),(\\d+), \\d{4}-\\d{2}-\\d{2}$"); private Logger log = Logger.getLogger(this.getClass()); public void map(WritableComparable key, Writable values, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)values).toString(); Matcher userRating = userRatingDate.matcher(line); IntWritable userId = new IntWritable(); IntWritable rating = new IntWritable(); if (line.matches("^\\d+:$")) { } else if (userRating.matches()) { userId.set(Integer.parseInt(userRating.group(1))); rating.set(Integer.parseInt(userRating.group(2))); output.collect(userId, rating); } else { log.error("Unexpected input: " + line); } } }

Reduce Code public class AverageValueReducer extends MapReduceBase implements Reducer { publicvoid reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0, count = 0; while (values.hasNext()) { sum += ((IntWritable)values.next()).get(); ++count; } output.collect(key, new FloatWritable(((float)sum)/count)); } }

VMWare Image Progress • Setup a Fedora Core 6 VM Image • Configured Image to always create a new key when moved • Turned Firewall off on the Image. • Installed Hadoop and configured it • Master, slaves, HDFS namespace, output directories, format HDFS • Successfully stared the HDFS, and MapReduce with a master and 1 slave • Ran a test job with 99 input files

VMWare setup on Grid • Need multiple copies of images on the grid • Namenode/JobTracker image (1 copy) • Datanode/TaskTracker images (many copies) • Different MAC address for each copy • Starting up Hadoop • Start each image copy on separate blades • Obtain image IP's from dhcp server and place them in config files for each image. • Start the HDFS and MapReduce from the master

VMWare Issues • Issues • Slaves would not connect to the master • Master would not start after formating the HDFS • Need root access to install VMPlayer on the grid • Images too big / not enough HD space • Solutions • Turn off firewall • Delete all the files from the namespace dir and then format the HDFS • E-mail the admin • Reduce the virtual harddrive on image

Evaluation Techniques • Processing time between the different configurations • Optimizations that can be made • Number of Map tasks vs. Reduce tasks per node • Explanation of prelim data • overhead w/ redundancy on grid • We’re all setup and ready to start our experiments • as soon as jkempena gives us our nodes back

Timeline • Week 5-6 • Install/Configure Environment • Develop Code • Week 7-8 • Run Experiments • Week 9-10 • Analyze Data • Write Paper • Present results

Questions?

Poly Hadoop

Poly Hadoop

Presentation Transcript

Hadoop

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Poly-X

Hadoop @

Hadoop

HADOOP

Cal Poly

Hadoop

Hadoop

Hadoop

Poly Mailers

Poly saccharides

Hadoop

Poly Pipe