330 likes | 350 Views
Programming with Parallel Design Patterns. ITCS 4/5145 Parallel Programming UNC-Charlotte, B. Wilkinson, 2013. Jan 14, 2014 PatternProg-1. Problem Addressed. To make parallel programming more useable and scalable.
E N D
Programming with Parallel Design Patterns ITCS 4/5145 Parallel Programming UNC-Charlotte, B. Wilkinson, 2013. Jan 14, 2014 PatternProg-1
Problem Addressed • To make parallel programming more useable and scalable. • Parallel programming -- writing programs to use multiple computers and processors collectively to solve problems -- has a very long history but still a challenge. 2
Issues Whereas in regular sequential programing, programmer has just one program to construct (and even that can be complex with multiple paths), in parallel/distributed programming, programmer has multiple programs/sequences that execute at the same time. Can be very difficult to do this correctly for all possible interleaving of statements in time. Depending upon programming model, messages need to be sent between concurrent processes/threads to share information (message-passing model) or access to shared data must be controlled very carefully (shared memory model). Programmer must understand complex interactions that can occur.
Traditional approach Use low level library routines within C programs that create parallel sequences and in the case of distributed memory systems send messages between separate processes. With these tools, programmer constructs a plan for how cooperating process/threads interact and explicitly implement every part of this plan. Each programmer may construct entirely different way of doing this. Generally ok for very small sample programs but as applications get larger, maintained by others or interact with programs written by others, this ad-hoc approach unlikely to be successful in a professional environment. 4
What is needed is a structured approach using methods that are known to lead to good program designs. 5
Pattern Programming Concept Programmer begins by constructing his program using established computational or algorithmic “patterns” that provide a structure. “Design patterns” part of software engineering for many years: • Reusable solutions to commonly occurring problems * • Provide guide to “best practices”, not a final implementation • Provides good scalable design structure • Can reason more easier about programs • Potential for automatic conversion into executable code avoiding low-level programming – We do that here. • Particularly useful for the complexities of parallel/distributed computing * http://en.wikipedia.org/wiki/Design_pattern_(computer_science)
In Parallel/Distributed Computing What patterns are we talking about? • Higher level algorithm patterns for forming a complete program such as workpool, pipeline, stencil, map-reduce. • Low-level algorithmic patterns that might be embedded into a program such as fork-join, broadcast/scatter/gather. • Lower level patterns used to create higher level patterns, i.e. a pattern hierarchy -- the implementation is at the lower level patterns. • We will start with higher-level message passing patterns.
Some Higher Level Message-Passing Patterns Workpool Workers Two-way connection Master Compute node Very widely applicable pattern Source/sink
In a full workpool implementation, a queue of work (task queue) in master. Once a slave completes a task, slave can be given another task from this queue -- load-balancing quality. Slaves Each slave given task to do. Results returned Master Task queue
Workpool embodies basic idea of parallel/distributed computing — dividing problem into parts that can be done together and arrange that each part is executed on different computing resources at the same time. With process-based workpool pattern, do not have access to shared data, as processes are completely separate entities without direct access to a common shared data. Later we will look at a slightly lower-level patterns called the master-slave pattern which does not have a load balancing task queue, and the thread-based thread pool.
More Specialized High level Patterns Pipeline Stage 1 Stage 2 Stage 3 Workers One-way connection Two-way connection Compute node Master Source/sink
Divide and Conquer Two-way connection Divide Merge Compute node Source/sink
All-to-All All compute nodes can communicate with all the other nodes Usually a synchronous computation - Performs number of iterations to obtain on solution e.g. N-body problem Two-way connection Compute node Source/sink Master
Stencil All compute nodes can communicate with only neighboring nodes On each iteration, each node communicates with neighbors to get stored computed values Usually a synchronous computation - Performs number of iterations to converge on solution, e.g. solving Laplace’s/heat equation Two-way connection Compute node Source/sink
Parallel Patterns -- Advantages • Abstracts/hides underlying computing environment • Generally avoids deadlocks and race conditions • Reduces source code size (lines of code) • Leads to automated conversion into parallel programs without need to write with low level message-passing routines such as MPI. • Hierarchical designs with patterns embedded into patterns, and pattern operators to combine patterns. Disadvantages • New approach to learn • Takes away some of the freedom from programmer • Performance reduced (c.f. using high level languages instead of assembly language)
Previous/Existing Work Patterns explored in several projects. • Industrial efforts • Intel Threading Building Blocks (TBB), Intel Cilk plus, Intel Array Building Blocks (ArBB). Focus on very low level patterns such as fork-join • Universities: • University of Illinois at Urbana-Champaign and University of California, Berkeley • University of Torino/Università di Pisa Italy
Book by Intel authors “Structured Parallel Programming: Patterns for Efficient Computation,” Michael McCool, James Reinders, Arch Robison, Morgan Kaufmann, 2012 Focuses on Intel tools
Note on Terminology “Skeletons” Sometimes term “skeleton” used to describe “patterns”, especially directed acyclic graphs with a source, a computation, and a sink.We do not make that distinction and use the term “pattern” whether directed or undirected and whether acyclic or cyclic. This is done elsewhere.
Our approach(Jeremy Villalobos’ UNC-C PhD thesis) Focuses on a few patterns of wide applicability (e.g. workpool, synchronous all-to-all, pipelined, stencil) but Jeremy took it much further than UPCRC and Intel. He developed a higher-level framework called “Seeds” Uses pattern approach to automatically distribute code across processor cores, computers, or geographical distributed computers and execute the parallel code.
“Seeds” Parallel Grid Application Framework • Some Key Features • Pattern-programming • Java user interface • (C++ version in development) • Self-deploys on computers, clusters, and geographically distributed computers. • Three development layers, exposing increasing detail. • We will use the basic level. http://coit-grid01.uncc.edu/seeds/
Seeds programming Slaves Several pattern implemented including Workpool, Pipeline, All-to-all, Stencil, … Generally three phases: • Master diffuses data to slaves • Slaves performs computations • Master gathers results for slaves Programmer only has to specific what master and slave do, and what is transferred between them (objects), without implementing the low level message passing routines Master Diffuse Slaves Compute Gather Master Master Message passing
Basic User Programmer Interface “Module” class Two classes: Module class – diffuse, compute and gather methods and any other methods associated with application Bootstrap class - creates an instance of the module class and starts the framework. Diffuse Compute Gather “Run module” bootstrap class
public Data Compute (Data data) { // input gets the data produced by DiffuseData() DataMap<String, Object> input = (DataMap<String,Object>)data; DataMap<String, Object> output = new DataMap<String, Object>(); Long seed = (Long) input.get("seed"); // get random seed Random r = new Random(); r.setSeed(seed); Long inside = 0L; for (int i = 0; i < DoubleDataSize ; i++) { double x = r.nextDouble(); double y = r.nextDouble(); double dist = x * x + y * y; if (dist <= 1.0) { ++inside; } } output.put("inside", inside);// store partial answer to return to GatherData() return output; // output will emit the partial answers done by this method } public Data DiffuseData (int segment) { DataMap<String, Object> d =new DataMap<String, Object>(); d.put("seed", R.nextLong()); return d; // returns a random seed for each job unit } public void GatherData (int segment, Data dat) { DataMap<String,Object> out = (DataMap<String,Object>) dat; Long inside = (Long) out.get("inside"); total += inside; // aggregate answer from all the worker nodes. } public double getPi() { // returns value of pi based on the job done by all the workers double pi = (total / (random_samples * DoubleDataSize)) * 4; return pi; } public int getDataCount() { return random_samples; } } Example module class Complete code (Monte Carlo pi in Assignment 1, see later for more details) Computation package edu.uncc.grid.example.workpool; import java.util.Random; import java.util.logging.Level; import edu.uncc.grid.pgaf.datamodules.Data; import edu.uncc.grid.pgaf.datamodules.DataMap; import edu.uncc.grid.pgaf.interfaces.basic.Workpool; import edu.uncc.grid.pgaf.p2p.Node; public class MonteCarloPiModule extends Workpool { private static final long serialVersionUID = 1L; private static final int DoubleDataSize = 1000; double total; int random_samples; Random R; public MonteCarloPiModule() { R = new Random(); } public void initializeModule(String[] args) { total = 0; Node.getLog().setLevel(Level.WARNING); // reduce verbosity for logging random_samples = 3000; // set number of random samples } Note: No explicit message passing
Seeds Implementations • Three Java versions available (2013): • Full JXTA P2P version requiring an Internet connection • JXTA P2P version but not needing an external network, suitable for a single computer • Multicore (thread-based) version for operation on a single computer • Multicore version much faster execution on single computer. Only difference is minor change in bootstrap class.
Bootstrap classJXTA P2P version This code deploys framework and starts execution of pattern package edu.uncc.grid.example.workpool; import java.io.IOException; import net.jxta.pipe.PipeID; import edu.uncc.grid.pgaf.Anchor; import edu.uncc.grid.pgaf.Operand; import edu.uncc.grid.pgaf.Seeds; import edu.uncc.grid.pgaf.p2p.Types; public class RunMonteCarloPiModule { public static void main(String[] args) { try { MonteCarloPiModule pi = new MonteCarloPiModule(); Seeds.start( "/path/to/seeds/seed/folder" , false); PipeID id = Seeds.startPattern(new Operand( (String[])null, new Anchor("hostname", Types.DataFlowRoll.SINK_SOURCE), pi )); System.out.println(id.toString() ); Seeds.waitOnPattern(id); Seeds.stop(); System.out.println( "The result is: " + pi.getPi() ) ; } catch (SecurityException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } } Different patterns have similar code
Bootstrap classMulticore version public class RunMonteCarloPiModule { public static void main(String[] args) { try { MonteCarloPiModule pi=new MonteCarloPiModule(); Thread id = Seeds.startPatternMulticore( new Operand( (String[])null, new Anchor( args[0], Types.DataFlowRole.SINK_SOURCE), pi ),4); id.join(); System.out.println( "The result is: " + pi.getPi() ) ; } catch (SecurityException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } } • Multicore version • Much faster on a multicore platform • Thread based • Bootstrap class does not need to start and stop JXTA P2P. Seeds.start() and Seeds.stop() not needed. Otherwise user code similar.
Measuring Time Can instrument code in the bootstrap class: public class RunMyModule { public static void main (String [] args ) { try{ long start = System.currentTimeMillis(); MyModule m = new MyModule(); Seeds.start(. ); PipeID id = ( … ); Seeds.waitOnPattern(id); Seeds.stop(); long stop = System.currentTimeMillis(); double time = (double) (stop - start) / 1000.0; System.out.println(“Execution time = " + time); } catch (SecurityException e) { … …
Compiling/executing • Can be done on the command line (ant script provided) or through an IDE (Eclipse)
Acknowledgements Work initiated by Jeremy Villalobos in his PhD thesis “Running Parallel Applications on a Heterogeneous Environment with Accessible Development Practices and Automatic Scalability,” UNC-Charlotte, 2011. Jeremy developed “Seeds” pattern programming software. Extending work to teaching environment supported by the National Science Foundation under grant "Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction" #1141005/1141006 (2012-2015). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
UNC-Charlotte Pattern Programming Grouphttp://coitweb.uncc.edu/~abw/PatternProgGroup/ Please contact B. Wilkinson if you would like to be involved in this work for academic credit
Next step • Assignment 1 – using the Seeds framework