500 likes | 518 Views
Learn about pattern programming for scalable and usable parallel programming. Explore high-level algorithmic patterns like workpool, pipeline, and stencil. Discover how patterns help in automating parallel program design and implementation.
E N D
Pattern ProgrammingBarry WilkinsonUniversity of North Carolina CharlotteComputer Science ColloquiumUniversity of North Carolina at GreensboroSeptember 11, 2012
Acknowledgment This work was initiated by Jeremy Villalobos and described in his PhD thesis: “Running Parallel Applications on a Heterogeneous Environment with Accessible Development Practices and Automatic Scalability,” UNC-Charlotte, 2011. Jeremy developed the so-called “Seeds” pattern programming software described here.
Problem Addressed • To make parallel programming more useable and scalable. • Parallel programming -- writing programs using multiple computers and processors collectively to solve problems -- has a very long history but still a challenge. • Traditional approach • Explicitly specifying message-passing (MPI), and • Low-level threads APIs (Pthreads, Java threads, OpenMP, …). • Need a better structured approach.
Pattern Programming Concept Programmer begins by constructing his program using established computational or algorithmic “patterns” that provide a structure. What patterns are we talking about? • Low-level algorithmic patterns that might be embedded into a program such as fork-join, broadcast/scatter/gather. • Higher level algorithm patterns for forming a complete program such as workpool, pipeline, stencil, map-reduce. • We concentrate upon higher-level “computational/algorithm” patterns rather than lower level patterns.
Some Patterns Workpool Workers Two-way connection Master Compute node Source/sink Derived from Jeremy Villalobos’s PhD thesis defense
Pipeline Stage 1 Stage 2 Stage 3 Workers One-way connection Two-way connection Master Compute node Source/sink
Divide and Conquer Two-way connection Divide Merge Compute node Source/sink
All-to-All Two-way connection Compute node Source/sink
Stencil On each iteration, each node communicates with neighbors to get stored computed values Two-way connection Usually a synchronous computation - Performs number of iterations to converge on solution, e.g. for solving Laplace’s/heat equation Compute node Source/sink
Note on Terminology “Skeletons” Sometimes term “skeleton” used to describe “patterns”, especially directed acyclic graphs with a source, a computation, and a sink.We do not make that distinction and use the term “pattern” whether directed or undirected and whether acyclic or cyclic. This is done elsewhere.
Design Patterns • “Design patterns” part of software engineering for many years • Reusable solutions to commonly occurring problems * • Patterns provide guide to “best practices”, not a final implementation • Provides good scalable design structure to parallel programs • Can reason more easier about programs Pattern programming takes this concept further and applies it to parallel programming. * http://en.wikipedia.org/wiki/Design_pattern_(computer_science)
Using Parallel Patterns • Advantages • Abstracts/hides underlying computing environment • Generally avoids deadlocks and race conditions • Reduces source code size (lines of code) • Hierarchical designs with patterns embedded into patterns, and pattern operators to combine patterns • Leads to an automated conversion into parallel programs without need to write with low level message-passing routines such as MPI. • Disadvantages • New approach to learn • Takes away some of the freedom from programmer • Performance reduced slightly (c.f. using high level languages instead of assembly language)
Previous/Existing Work • Patterns/skeletons explored in several projects. • Universities: • University of Illinois at Urbana-Champaign and University of California, Berkeley • University of Torino/Università di Pisa Italy • Industrial efforts • Intel • Microsoft
Universal Parallel Computing Research Centers (UPCRC) University of Illinois at Urbana-Champaign and University of California, Berkeley with Microsoft and Intel in 2008 (with combined funding of at least $35 million). Co-developed OPL (Our Pattern Language). Group of twelve computational patterns identified: • Finite State Machines • Circuits • Graph Algorithms • Structured Grid • Dense Matrix • Sparse Matrix in seven general application areas
Intel Focuses on very low level patterns such as fork-join, and provides constructs for them in: • Intel Threading Building Blocks (TBB) • Template library for C++ to support parallelism • Intel Cilk plus • Compiler extensions for C/C++ to support parallelism • Intel Array Building Blocks (ArBB) • Pure C++ library-based solution for vector parallelism Above are somewhat competing tools obtained through takeovers of small companies. Each implemented differently.
New book 2012 from Intel authors “Structured Parallel Programming: Patterns for Efficient Computation,” Michael McCool, James Reinders, Arch Robison, Morgan Kaufmann, 2012 Focuses entirely on Intel tools
Using patterns with Microsoft C# http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=19222 Again very low-level with patterns such as parallel for loops.
University of Torino/Università di Pisa, Italy Closest to our workhttp://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace:about
Our approach(Jeremy Villalobos’ UNC-C PhD thesis) Focuses on a few patterns of wide applicability: • Workpool • Synchronous all-to-all • Pipeline • Stencil and a few others but Jeremy took it much further. He developed a higher-level framework called “Seeds” Automatically distributes code across processor cores, computers, or geographical distributed computers and execute parallel code according to pattern.
“Seeds” Parallel Grid Application Framework • Some Key Features • Pattern-programming (Java) user interface • Self-deploys on computers, clusters, and geographically distributed computers • Load balances • Three levels of user interface http://coit-grid01.uncc.edu/seeds/
Seeds Development Layers • Basic • Intended for programmers that have basic parallel computing background • Based on skeletons and patterns • Advanced: Used to add or extend functionality such as: • Create new patterns • Optimize existing patterns or • Adapt existing pattern to non-functional requirements specific to the application • Expert: Used to provide basic services: • Deployment • Security • Communication/Connectivity • Changes in the environment Derived from Jeremy Villalobos’s PhD thesis defense
Deployment • Several different ways implemented during PhD work including using Globus grid computing software • Deployment with SSH now preferred
Basic User Programmer Interface • Programmer selects a pattern and implements three principal Java methods: • Diffuse method – to distribute pieces of data. • Compute method – the actual computation • Gather method – used to gather the results • Programmer also has to fill in details in a “bootstrap” class to deploy and start the framework. Diffuse Compute Gather Bootstrap class Framework self-deploys on a geographically distributed or local platform and executes pattern.
Example: Deploy a workpool pattern to compute p using Monte Carlo method Monte Carlo p calculation • Basis of Monte Carlo calculations is use of random selections • In this case, circle formed within a square • Points within square chosen randomly • Fraction of points within circle = p/4 • Only one quadrant used in code
Complete code for Monte Carlo p computation public Data Compute (Data data) { DataMap<String, Object> input = (DataMap<String,Object>)data; DataMap<String, Object> output = new DataMap<String, Object>(); Long seed = (Long) input.get("seed"); Random r = new Random(); r.setSeed(seed); Long inside = 0L; for (int i = 0; i < DoubleDataSize ; i++) { double x = r.nextDouble(); double y = r.nextDouble(); double dist = x * x + y * y; if (dist <= 1.0) { ++inside; } } output.put("inside", inside); return output; } public Data DiffuseData (int segment) { DataMap<String, Object> d =new DataMap<String, Object>(); d.put("seed", R.nextLong()); return d; } public void GatherData (int segment, Data dat) { DataMap<String,Object> out = (DataMap<String,Object>) dat; Long inside = (Long) out.get("inside"); total += inside; } public double getPi() { double pi = (total / (random_samples * DoubleDataSize)) * 4; return pi; } public int getDataCount() { return random_samples; } } Note: No message passing (MPI etc) package edu.uncc.grid.example.workpool; import java.util.Random; import java.util.logging.Level; import edu.uncc.grid.pgaf.datamodules.Data; import edu.uncc.grid.pgaf.datamodules.DataMap; import edu.uncc.grid.pgaf.interfaces.basic.Workpool; import edu.uncc.grid.pgaf.p2p.Node; public class MonteCarloPiModule extends Workpool { private static final long serialVersionUID = 1L; private static final int DoubleDataSize = 1000; double total; int random_samples; Random R; public MonteCarloPiModule() { R = new Random(); } public void initializeModule(String[] args) { total = 0; Node.getLog().setLevel(Level.WARNING); random_samples = 3000; }
Seeds WorkpoolDiffuseData, Compute, and GatherData Methods GatherData DiffuseData Private variable total (answer) DataMap d Master Returns d to each slave Data argument dat Compute Data argument data DataMap input Slaves DataMap output Note DiffuseData, Compute and GatherData methods start with a capital letter although method names should not! DataMap d created in diffuse DataMap output created in compute
Data and DataMap classes • For implementation convenience two classes: • Data class used to pass data between master and slaves Uses a “segment” number to keep track of packets • as they go from one method to another. • DataMap class inside compute method • DataMapis a subclass of Data and so allows casting. DataMap methods • put (String, data) – puts data into DataMap identified by string • get (String, data) – gets stored data identified by string In the pi code, data is a Long. DataMap extends Java HashMap which implement a Map, see http://doc.java.sun.com/DocWeb/api/java.util.HashMap
segment used by Framework to keep track of where to put results Data cast into a DataMap By framework public Data DiffuseData (int segment) { DataMap<String, Object> d =new DataMap<String, Object>(); input Data = …. d.put(“name_of_inputdata", inputData); return d; } public Data Compute (Data data) { DataMap<String, Object> input = (DataMap<String,Object>)data; //data produced by DiffuseData() DataMap<String, Object> output = new DataMap<String, Object>(); //output returned to gatherdata inputData = input.get(“name_of_inputdata”); … // computation output.put("name_of _results", results); // to return to GatherData() return output; } public void GatherData (int segment, Data dat) { DataMap<String,Object> out = (DataMap<String,Object>) dat; outdata = out.get (“name_of_results”); result … // aggregate outdata from all the worker nodes. result a private variable } GatherData gives back Data object with a segment number By framework
Bootstrap class This code deploys framework and starts execution of pattern package edu.uncc.grid.example.workpool; import java.io.IOException; import net.jxta.pipe.PipeID; import edu.uncc.grid.pgaf.Anchor; import edu.uncc.grid.pgaf.Operand; import edu.uncc.grid.pgaf.Seeds; import edu.uncc.grid.pgaf.p2p.Types; public class RunMonteCarloPiModule { public static void main(String[] args) { try { MonteCarloPiModule pi = new MonteCarloPiModule(); Seeds.start( "/path/to/seeds/seed/folder" , false); PipeID id = Seeds.startPattern(new Operand( (String[])null, new Anchor( "hostname" , Types.DataFlowRoll.SINK_SOURCE), pi ) ); System.out.println(id.toString() ); Seeds.waitOnPattern(id); System.out.println( "The result is: " + pi.getPi() ) ; Seeds.stop(); } catch (SecurityException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } } Different patterns have similar code
Compiling/executing • Can be done on the command line (ant script provided), or through an IDE (Eclipse):
Another Workpool example - Matrix addition MatrixAddModule.java(continues on several slides) package edu.uncc.grid.example.workpool; import … public class MatrixAddModule extends Workpool { private static final long serialVersionUID = 1L; int[][] matrixA, matrixB, matrixC; public MatrixAddModule() { matrixC = new int[3][3]; } public void initMatrices(){ matrixA = new int[][]{{2,5,8},{3,4,9},{1,5,2}}; matrixB = new int[][]{{2,5,8},{3,4,9},{1,5,2}}; } public void initializeModule(String[] args) { Node.getLog().setLevel(Level.WARNING); }
public Data DiffuseData(int segment) { int[] rowA = new int[3]; int[] rowB = new int[3]; DataMap<String, int[]> d =new DataMap<String, int[]>(); int k = segment; // segment variable identifies slave task for (inti=0;i<3;i++) { //Copy one row of A and one row of B into d rowA[i] = matrixA[k][i]; rowB[i] = matrixB[k][i]; } d.put("rowA",rowA); d.put("rowB",rowB); return d; }
public Data Compute(Data data) { int[] rowC = new int[3]; DataMap<String, int[]> input = (DataMap<String,int[]>)data; DataMap<String, int[]> output = new DataMap<String, int[]>(); int[] rowA = (int[]) input.get("rowA"); int[] rowB = (int[]) input.get("rowB"); for (inti=0;i<3;i++){ //computation rowC[i] = rowA[i] + rowB[i]; } output.put("rowC",rowC); return output; }
public void GatherData(int segment, Data dat) { DataMap<String,int[]> out = (DataMap<String,int[]>) dat; int[] rowC = (int[]) out.get("rowC"); for (inti=0;i<3;i++) { matrixC[segment][i]= rowC[i]; } } public void printResult(){ for (inti=0;i<3;i++){ System.out.println(""); for(int j=0;j<3;j++){ System.out.print(matrixC[i][j] + ","); } } } public intgetDataCount() { return 3; } }
Bootstrap class to run frameworkRunMatrixAddModule.java package edu.uncc.grid.example.workpool; import … public class RunMatrixAddModule { public static String localhost = "T5400"; public static String seedslocation = "C:\\seeds_2.0\\pgaf"; public static void main (String [] args ) { try{ Seeds.start( seedslocation ,false); MatrixAddModule m = new MatrixAddModule(); m.initMatrices(); PipeID id = Seeds.startPattern(new Operand ((String[])null,new Anchor (localhost,Types.DataFlowRoll.SINK_SOURCE),m)); Seeds.waitOnPattern(id); m.printResult(); Seeds.stop(); } … // exceptions }
Multicore version of SeedsJust done by Jeremy (September 2012) public class RunMonteCarloPiModule { public static void main(String[] args) { try { MonteCarloPiModule pi=new MonteCarloPiModule(); Thread id = Seeds.startPatternMulticore( new Operand( (String[])null , new Anchor( args[0], Types.DataFlowRole.SINK_SOURCE) , pi ), 4 ); id.join(); System.out.println( "The result is: " + pi.getPi() ) ; } catch (SecurityException e) { … } } } • Designed for a multicore shared memory platform. • Much faster. • Thread-based. Does not use JXTA P2P network to run cluster nodes. • Bootstrap class does not need to start and stop JXTA P2P. Otherwise user code similar.
Matrix multiplication Courtesy of Ben Barbour, Research Assistant/graduate student, Dept. of Computer Science, UNC-Wilmington
Seeds “CompleteSyncGraph” pattern • A pattern that combines all-to-all with synchronous iteration. • Slave processes can exchange data with each other at each iteration (synchronization point) without stopping pattern, i.e. without returning from Seeds. • Some problems of this type require a number of iterations to converge on the solution. • Example • N-body problem • Solving general system of linear equations by iteration
Synchronous all-to-all pattern (Seeds CompleteSyncGraph pattern) Example with termination after converging on solution Solving General System of Linear Equations by iteration Suppose equations are of a general form with n equations and n unknowns: where the unknowns are x0, x1, x2, … xn-1 (0 <= i < n).
By rearranging the ith equation, Pi computes the following: P0 Pn-1 (Excluding Pi) Broadcasting their result to every other process after each iteration until convergence Pi
Pattern operators Can create your own combined patterns with a pattern operator. Example: Adding Stencil and All-to-All synchronous pattern Example use: Heat distribution simulation (Laplace’s eq.) • Multiple cells on a stencil pattern work in a loop parallel fashion, computing and synchronizing on each iteration. • However, every x iterations, they must implement an all-to-all communication pattern to run an algorithm to detect termination. Directly from Jeremy Villalobos’s PhD thesis
http://coit-grid01.uncc.edu/seeds/ Tutorial page
Introducing pattern programming into undergraduate curriculum • Pattern programming introduced into our regular undergraduate parallel programming course before lower-level tools such as MPI, OpenMP, CUDA. • Prototype course on NCREN between UNC-Charlotte and UNC-Wilmington. • Future offering will include other sites.
Acknowledgments Introducing pattern programming into the undergraduate curriculum is supported by the National Science Foundation under grants #1141005 and #1141006, “Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction,” a collaborative project with Dr. C. Ferner, co-PI at UNC-Wilmington. Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Pattern Programming Research Group • 2011 • Jeremy Villalobos (PhD awarded, continuing involvement) • Saurav Bhattara (MS thesis, graduated) • Spring 2012 • Yawo Adibolo (ITCS 6880 Individual Study) • Ayay Ramesh (ITCS 6880 Individual Study) • Fall 2012 • Haoqi Zhao (MS thesis, implementing a C++ version of Seeds) • Pohua Lee (BS senior project)
Research group home page http://coitweb.uncc.edu/~abw/PatternProgGroup Currently needs a password
Some publications • Jeremy F. Villalobos and Barry Wilkinson, “Skeleton/Pattern Programming with an Adder Operator for Grid and Cloud Platforms,” The 2010 International Conference on Grid Computing and Applications (GCA’10), July 12-15, 2010, Las Vegas, Nevada, USA. • Jeremy F. Villalobos and Barry Wilkinson, “Using Hierarchical Dependency Data Flows to Enable Dynamic Scalability on Parallel Patterns,” High-Performance Grid and Cloud Computing Workshop, 25th IEEE International Parallel & Distributed Processing Symposium, Anchorage (Alaska) USA, May 16-20, 2011. Also presented by B. Wilkinson as Session 4 in “Short Course on Grid Computing” Jornadas Chilenas de Computación, INFONOR-CHILE 2010, Nov. 18th - 19th, 2010, Antofagasta, Chile. http://coitweb.uncc.edu/~abw/Infornor-Chile2010/GridWorkshop/index.html