Coursework II: Google MapReduce in GridSAM

Coursework II: Google MapReduce in GridSAM Steve Crouch s.crouch@software.ac.uk, stc@ecs School of Electronics and Computer Science

Contents • Introduction to Google’s MapReduce • Applications of MapReduce • The coursework • Extending a basic MapReduce framework provided in pseudocode • Coursework deadline: 27th March 4pm • Handin via ECS Coursework Handin System

Google MapReduce MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Google Inc., OSDI 2004. http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/mapreduce-osdi04.pdf

Google’s Need for a Distributed Programming Model and Infrastructure • Google implement many computations over a lot of data • Input: e.g. crawled documents, web request logs, etc. • Output: e.g. inverted indices, web document graphs, pages crawled per host, frequent per-day queries, etc. • Input usually very large (> 1TB) • Computations need to be distributed for timeliness of results • Want to do this in an easy, but scalable and robust way; provide a programming model (with a suitable abstraction) for the distributed processing aspects • Realised many computations follow a map / reduce approach • map operation applied to a set of logical input “records” to generate intermediate key/value pairs • reduce operation applied to all intermediate values sharing same key to combine data in a useful way • Used as basis for rewrite of their production indexing system!

History of MapReduce – Inspired by Functional Programming! • Functional operations only create new data structures and do not alter existing ones • Order of operations does not matter • Emphasis on data flow • e.g. Higher-Order functions in Lisp • map() – applies a function to each value in a sequence • fun map f [ ] = [ ] | map f (x::xs) = (f x) :: (map f xs) • reduce() – combines all elements of a sequence using a binary operator • fun reduce f c [ ] = c | reduce f c (x::xs) = f x (reduce f c xs)

Looking at map and reduce Another Way… • map(): • Delegates or distributes the computation for each piece of data to a given function, creating a new set of data • Each computation cannot see the effects of the other computations • The order of computation is irrelevant • reduce() takes this created data and reduces it to something we want • map() moves left to right over the list, applying the given function… can this be exploited in distributed computing?

Applying the Programming Model to the Data Distributed Computing Seminar: Lecture 2: MapReduce Theory and Implementation, Christophe Bisciglia, Aaron Kimball & Sierra Michels-Slettvet, Summer 2007.

For Example… • Counting the number of occurrences of each word in a large collection of documents: reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); • map outputs each word plus occurrence count • reduce sums together all counts emitted for each word doc1,”Hello world” map() Hello, 1 2 (Hello) reduce() doc2,”Hello there” map() Hello, 1 1 (world) world, 1 1 (there) there, 1

How it Works in Practice 7. When all maps and reduces done, Master wakes up user program which resumes 2. Master assigns M map tasks and R reduce tasks to idle workers (either one map or one reduce task each) 1. User program: - Splits work into M 64MB pieces - Program starts up across compute nodes as either Master or Worker (with exactly 1 Master) • 3. A map Worker: • Parses key/value pairs out of its input • Passes each key/value to map function • Buffers intermediate keys/values in mem 4. Periodically, map Worker writes intermediate key/value pairs to disk, informing Master of their locations, who forwards to reduce Workers 5/6. When notified of locations by Master, reduce Worker remotely reads in data, sorts and groups data by key, passes to reduce function, results appended to output file "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.

Coursework: Part II

Learning Objectives: • To develop a general architectural and operational understanding of typical production-level grid software. • To develop the programming skills required to drive typical services on a production-level grid.

Tasks • Download and install the GridSAM server and client • (a) Extend some Java code stubs (which use the GridSAM Java API) to submit and monitor jobs to GridSAM • (b) Extend some pseudocode that describes a basic MapReduce framework for performing word counting on a number of files

Coursework: Part II –Installing GridSAM

Pre-Requisites • Pre-requisites: • Client and Server: Linux only (e.g. SuSE 9.0, RedHat, Debian, Ubuntu) • May work on other Linuxs but no exhaustive testing • Tested on undergrad Linux boxes • Requires Java JDK 6 (not JRE) or above • Beware: • Firewalls blocking 8080 and your FTP port inbetween client and server – add exceptions • VPNs can cause problems with staging data to/from GridSAM

Preparation/Installation • Java 7 recommended • Note: you may need to upgrade your Java • Ensure JAVA_HOME set on path • Install client… • Download gridsam-2.3.0-client.zip from coursework page • unzip gridsam-2.3.0-client.zip (into a file path that contains no spaces) • cd gridsam-2.3.0-client • java SetupGridSAM • Install server (Linux only)… • Can just reuse your Apache Tomcat 5.5.28/6.0.32 from mgrid (see mgrid install slides) • Download gridsam.war from coursework page • Shutdown Tomcat and copy in gridsam.war to apache-tomcat-6.0.32/webapps and restart Tomcat • Can check log files in apache-tomcat-6.0.32/webapps/gridsam/WEB-INF/logs if any problems occur

Coursework Materials • Download COMP3019-materials.tgz from coursework page • Copy to gridsam-2.3.0-client directory • Unpack, you’ll find some GridSAMExample* files • ./GridSAMExampleCompile to check compilation • Code not complete; that’s the coursework! • GridSAMExampleRun wont until you done the coursework • Note server.domain and port in script – you need to change these to point at your server (use HTTP not HTTPS!!) • Use the scripts and Java code as a basis • Refer to API docs on coursework page as required • To obtain job status, use e.g.: jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString(); • Doing job.getLastKnownStage().getState().toString() directly wont work

The Coursework • See the coursework handout on the COMP3019 page: • http://www.ecs.soton.ac.uk/~stc/COMP3019 • Notes for Part 1: • When specifying multiple arguments to your m-grid applet, there is a single string you can use as an argument. • Consider how you pass the two necessary arguments (i.e. a character and a textfile) as a single argument into the applet • To load the text file below into your applet, package it into the jar file along with the code, and use the following in the applet: • InputStream in = getClass().getResourceAsStream(“textfile.txt”); • Part 2 (GridSAM) Notes: • If you encounter problems using the GridSAM FTP server, some students have found success using a StupidFTP server (available under Ubuntu) • When you want to check the status of a job use e.g. jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString(); • Doing job.getLastKnownStage().getState().toString() directly wont work

Coursework: Part II –Running a Command Line Example

Example using File Staging • Objectives: submit simple job with data input and output requirements and monitor progress submit JSDL OMII GridSAM Server OMII GridSAM Client monitor 2 input files OMII GridSAM FTP Server 1 output file

JSDL Example • Gridsam-2.3.0/examples/remotecat-staging.jsdl • Change ftp URLs to match your ftp server e.g. ftp://anonymous:anonymous@localhost:55521/concat.sh ): <JobDescription> <JobIdentification> … </JobIdentification> <Application> <POSIXApplication xmlns="http://schemas.ggf.org/jsdl/2005/06/jsdl-posix"> <Executable>bin/concat</Executable> <Argument>dir2/subdir1/file2.txt</Argument> <Output>stdout.txt</Output> <Error>stderr.txt</Error> <Environment name="FIRST_INPUT">dir1/file1.txt</Environment> </POSIXApplication> </Application> …

JSDL Example <DataStaging> <FileName>dir2/subdir1/file2.txt</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/input2.txt</URI> </Source> </DataStaging> <DataStaging> <FileName>stdout.txt</FileName> <CreationFlag>overwrite</CreationFlag> <DeleteOnTermination>true</DeleteOnTermination> <Target> <URI>ftp://ftp.do:55521/stdout.txt</URI> </Target> </DataStaging> </JobDescription> </JobDefinition> <DataStaging> <FileName>bin/concat</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/concat.sh</URI> </Source> </DataStaging> <DataStaging> <FileName>dir1/file1.txt</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/input1.txt</URI> </Source> </DataStaging>

Set up the GridSAM Client’s FTP Server • To allow GridSAM to retrieve input and store output • In gridsam-2.3.0-client directory: > ./gridsam.shGridSAMFTPServer -p 55521 -d examples/ 2010-04-29 08:20:59,250 WARN [GridSAMFTPServer] (main:) ../data/examples/ is exposed through FTP at ftp://anonymous@152.78.237.90:55521/ 2010-04-29 08:20:59,268 WARN [GridSAMFTPServer] (main:) Please make sure you understand the security implication of using anonymous FTP for file staging. FtpServer.server.config.root.dir = ../data/examples/ FtpServer.server.config.data = /home/omii/COMP3019/omii-uk-client/gridsam/ftp/ftp1215306750 FtpServer.server.config.port = 55521 FtpServer.server.config.self.host = 152.78.237.90 Started FTP • Exposes the examples directory through FTP on port 55521 (anonymous access!) • Create input1.txt and input2.txt in this directory with some text in them

CLI Example: Submit to GridSAM Server • Ensure Java is on your path • In gridsam-2.3.0-client directory: • Submit to GridSAM server: • ./gridsam.sh GridSAMSubmit –s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j examples/remotecat-staging.jsdl • Unique job ID is returned • i.e. UID is urn:gridsam:<characters>

CLI Example: Monitoring the Job • Monitor job until completion: > ./gridsam.sh GridSAMStatus -s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j <unique_job_id> • <unique_job_id> is entire urn:gridsam:<characters> string • Job progress indicated by current state: • Pending, Staging-in, Staged-in, Active, Executed, Staging-out, Staged-out, Done • When complete, output resides in the stdout.txt file in the examples/ directory

What to Hand In • Submit: source code, results files, parameter files and output • Other parts that require written answers should form a separate document: • In text, Microsoft Word or PDF • Up to 800 words in length, not including any source or trace output • Submission via ECS Coursework Handin system: Single Zip file: source, results, parameter files, output & written answers

Coursework II: Google MapReduce in GridSAM