160 likes | 341 Views
Poly Hadoop. CSC 550 May 22, 2007. Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko. Accomplishments. ITS Grid Account OpenPBS, Java, Subversion, Bash, Perl, Vim Hadoop on ITS Grid Account HDFS, Node Configurations MapReduce Code Hadoop Running Natively on ITS Grid
E N D
Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko
Accomplishments • ITS Grid Account • OpenPBS, Java, Subversion, Bash, Perl, Vim • Hadoop on ITS Grid Account • HDFS, Node Configurations • MapReduce Code • Hadoop Running Natively on ITS Grid • Hadoop on VMware Images • Fedora 6, Image & Hadoop Configuration
Grid Properties • All Jobs Queued Through Management Node • qsub <resource_list> script.bsh • Resource list can include which physical node assignment, number of processors, allowed execution time, etc. • Script Executes on Only One Physical Node • User Environment Replicated on All Nodes • Shared File System
Hadoop on GridIssues & Solutions • Shared File System vs. Local File System • Issues • Single Configuration File Shared by All Hadoop Nodes • Hadoop DataNodes Need “Local” Directories • The File System is Shared • Solution • Create Separate Directories Using Node’s HostName • Supply HostName via Java System Properties • Use Java System Property Expansion in Hadoop Configuration File
Hadoop on GridIssues & Solutions (cont.) • Pseudo-Dynamic namenode Selection • Issues • Physical Node Assignments Not Guaranteed • Hadoop Configuration File Specifies Nodes to Use • Solution • On-the-Fly Modification of Hadoop Configuration File • Yay for XML! • On-the-Fly Modification of Hadoop masters and slaves Files
Hadoop on Grid Scripts • run_createdirs.sh • Creates dirs for each physical node • update_sitexml.pl • Dynamically updates hadoop-site.xml • run_real_test.sh • Formats HDFS • Starts job management and DFS • Puts dataset on DFS • Runs MapReduce jobs • Exports output • Stops MapReduce and DFS
MapReduce Progress • Pushing Dataset Onto Hadoop FS • Simple Command Done in qsub Script • MapReduce Java Code • Selecting Number Of Jobs • Map Jobs = 10 Per Node • Reduce Jobs = 2 Per Node
Map Code public class UserRatingMapper extends MapReduceBase implements Mapper { private static Pattern userRatingDate = Pattern.compile("^(\\d+),(\\d+), \\d{4}-\\d{2}-\\d{2}$"); private Logger log = Logger.getLogger(this.getClass()); public void map(WritableComparable key, Writable values, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)values).toString(); Matcher userRating = userRatingDate.matcher(line); IntWritable userId = new IntWritable(); IntWritable rating = new IntWritable(); if (line.matches("^\\d+:$")) { } else if (userRating.matches()) { userId.set(Integer.parseInt(userRating.group(1))); rating.set(Integer.parseInt(userRating.group(2))); output.collect(userId, rating); } else { log.error("Unexpected input: " + line); } } }
Reduce Code public class AverageValueReducer extends MapReduceBase implements Reducer { publicvoid reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0, count = 0; while (values.hasNext()) { sum += ((IntWritable)values.next()).get(); ++count; } output.collect(key, new FloatWritable(((float)sum)/count)); } }
VMWare Image Progress • Setup a Fedora Core 6 VM Image • Configured Image to always create a new key when moved • Turned Firewall off on the Image. • Installed Hadoop and configured it • Master, slaves, HDFS namespace, output directories, format HDFS • Successfully stared the HDFS, and MapReduce with a master and 1 slave • Ran a test job with 99 input files
VMWare setup on Grid • Need multiple copies of images on the grid • Namenode/JobTracker image (1 copy) • Datanode/TaskTracker images (many copies) • Different MAC address for each copy • Starting up Hadoop • Start each image copy on separate blades • Obtain image IP's from dhcp server and place them in config files for each image. • Start the HDFS and MapReduce from the master
VMWare Issues • Issues • Slaves would not connect to the master • Master would not start after formating the HDFS • Need root access to install VMPlayer on the grid • Images too big / not enough HD space • Solutions • Turn off firewall • Delete all the files from the namespace dir and then format the HDFS • E-mail the admin • Reduce the virtual harddrive on image
Evaluation Techniques • Processing time between the different configurations • Optimizations that can be made • Number of Map tasks vs. Reduce tasks per node • Explanation of prelim data • overhead w/ redundancy on grid • We’re all setup and ready to start our experiments • as soon as jkempena gives us our nodes back
Timeline • Week 5-6 • Install/Configure Environment • Develop Code • Week 7-8 • Run Experiments • Week 9-10 • Analyze Data • Write Paper • Present results