Applications and Runtime for multicore/manycore

Applications and Runtime for multicore/manycore March 21 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN gcf@indiana.edu http://grids.ucs.indiana.edu/ptliupages/presentations/

Today Tomorrow RMS: Recognition Mining Synthesis Recognition Mining Synthesis Is it …? What is …? What if …? Find a model instance Create a model instance Model Model-less Real-time streaming and transactions on static – structured datasets Very limited realism Model-based multimodal recognition Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation

Intel’s Application Stack Discussed in Seminars http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/ Rest mainly classicparallel computing

Some Bioinformatics Datamining • 1. Multiple Sequence Alignment (MSA) • Kernel Algorithms • HMM (Hidden Markov Model) • pairwise alignments (dynamic programming) with heuristics (e.g. progressive, iterative method) • 2. Motif Discovery • Kernel Algorithms: • MEME (Multiple Expectation Maximization for Motif Elicitation) • Gibbs sampler • 3. Gene Finding (Prediction) • Hidden Markov Methods • 4. Sequence Database Search • Kernel Algorithms • BLAST (Basic Local Alignment Search Tool) • PatternHunter • FASTA

Berkeley Dwarfs Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids Pleasingly Parallel Combinatorial Logic Graph Traversal Dynamic Programming Branch & Bound Graphical Models (HMM) Finite State Machine Consistent in Sprit with Intel Analysis I prefer to develop a few key applications rather than debate their classification!

Client side Multicore applications • “Lots of not very parallel applications” • Gaming; Graphics; Codec conversion for multiple user conferencing …… • Complex Data querying and data manipulation/optimization/regression ; database and datamining (including computer vision) (Recognition and Mining for Intel Analysis) • Statistical packages as in Excel and R • Scenario and Model simulations (Synthesis for Intel) • Multiple users give several Server side multicore applications • There are important architecture issues including memory bandwidth not discussed here!

Approach I • Integrate Intel, Berkeley and other sources including database (successful on current parallel machines like scientific applications) • and define parallel approaches in “white paper” • Develop some key examples testing 3 parallel programming paradigms • Coarse Grain functional Parallelism (as in workflow) including pleasingly parallel instances with different data • Fine Grain functional Parallelism (as in Integer Programming) • Data parallel (Loosely Synchronous as in Science) • Construct so can use different run time including perhaps CCR/DSS, MPI, Data Parallel .NET • May be these will become libraries used as in MapReduce Workflow Coordination Languages ….

Approach II • Have looked at CCR in MPI style applications • Seems to work quite well and support more general messaging models • NAS Benchmark using CCR to confirm its utility • Developing 4 exemplar multi-core parallel applications • Support Vector Machines (linear algebra) Data Parallel • Deterministic Annealing (statistical physics) Data Parallel • Computer Chess or Mixed Integer Programming Fine Grain Parallelism • Hidden Markov Method (Genetic Algorithms) Loosely Coupled functional Parallelism • Test high level coordination to such parallel applications in libraries

CCR for Data Parallel (Loosely Synchronous) Applications • CCR supports general coordination of messages queued in ports in Handler or Rendezvous mode • DSS builds service model on CCR and supports coarse grain functional parallelism • Basic CCR supports fine grain parallelism as in computer chess (and use STM enabled primitives?) • MPI has well known collective communication which supply scalable global synchronization etc. • Look at performance of MPI_Sendrecv • What is model that encompasses best shared and distributed memory approaches for “data parallel” problems • This could be put on top of CCR? • Much faster internal versions of CCR

(a) Pipeline (b) Shift Port0 Port0 Thread0 Thread0 Port1 Port1 Thread1 Thread1 Port2 Thread2 Port2 Thread2 Port3 Port3 Thread3 Thread3 Port0 Thread0 Port1 Thread1 Port2 Thread2 Port3 Thread3 Use onAMD 4-core Xeon 4-core Xeon 8-core Latter do up to 8 way parallelism (d) Exchange (c) Two Shifts Port0 Thread0 Port1 Thread1 Port2 Thread2 Port3 Thread3 Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive

Read Messages Write ExchangedMessages Write ExchangedMessages Port0 Thread0 Port0 Thread0 Thread0 Thread1 Port1 Thread1 Port1 Thread1 Thread2 Port2 Port2 Thread2 Thread2 Thread3 Port3 Port3 Thread3 Thread3 Exchanging Messages with 1D Torus Exchangetopology for loosely synchronous execution in CCR Stage Stage Stage Break a single computation into different number of stages varying from 1.4 microseconds to 14 seconds for AMD (1.6 microseconds to 16 seconds for Xeon Quad core)

4-way Pipeline Pattern 4 Dispatcher Threads HP Opteron 1.4 microseconds computation per stage Time Seconds Overhead = Computation 8.04 microseconds overhead per stage averaged from 1 to 10 million stages 14 microseconds computation per stage Computation Component if no Overhead Stages (millions) Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107 stages on HP Opteron Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode

Stage Overhead versus Thread Computation time • Overhead per stage constant up to about million stages and then increases 14 Seconds Stage Computation 14 Microseconds

4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Time Seconds Overhead = Computation 12.40 microseconds per stage averaged from 1 to 10 million stages Computation Component if no Overhead Stages (millions) Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107 stages on Dell 2 processor 2-core each Xeon Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode

Summary of Stage Overheads for AMD 2-core 2-processor Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 28 microseconds (500,000 stages)

Summary of Stage Overheads for Intel 2-core 2-processor Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 30 microseconds. AMD overheads in parentheses These measurements are equivalent to MPI latencies

Summary of Stage Overheads for Intel 4-core 2-processor Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 30 microseconds. 2-core 2-processor Xeon overheads in parentheses These measurements are equivalent to MPI latencies

Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes 33.92 microseconds So overhead of 6.1 microseconds modest Message size is just one integer Choose computation unit that is appropriate for a few microsecond stage overhead AMD 4-way 27.94 microsecond Computation Unit XP-Pro 8-way Parallel Pipeline on two 4-core Xeon XP Pro

Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes 33.92 microseconds So overhead of 8.2 microseconds modest Shift versus pipeline adds a microsecond to cost Unclear what causes second peak AMD 4-way 27.94 microsecond Computation Unit XP-Pro XP Pro VISTA 8-way Parallel Shift on two 4-core Xeon

Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes 33.92 microseconds So overhead of 22.3 microseconds significant Unclear why double shift slow compared to shift Exchange performance partly reflects number of messages Opteron overheads significantly lower than Intel AMD 4-way 27.94 microsecond Computation Unit XP-Pro XP Pro 8-way Parallel Double Shift on two 4-core Xeon

AMD 2-core 2-processor Bandwidth Measurements • Previously we measured latency as measurements corresponded to small messages. We did a further set of measurements of bandwidth by exchanging larger messages of different size between threads • We used three types of data structures for receiving data • Array in thread equal to message size • Array outside thread equal to message size • Data stored sequentially in a large array (“stepped” array) • For AMD and Intel, total bandwidth 1 to 2 Gigabytes/second

Intel 2-core 2-processor Bandwidth Measurements • For bandwidth, the Intel did better than AMD especially when one exploited cache on chip with small transfers • For both AMD and Intel, each stage executed a computational task after copying data arrays of size 105 (labeled small), 106 (labeled large) or 107 double words. The last column is an approximate value in microseconds of the compute time for each stage. Note that copying 100,000 double precision words per core at a gigabyte/second bandwidth takes 3200 µs. The data to be copied (message payload in CCR) is fixed and its creation time is outside timed process

4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Time Seconds Slope Change (Cache Effect) Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words Array Size: Millions of Double Words Typical Bandwidth measurements showing effect of cache with slope change5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array onDell Xeon Multicore

Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release) DSS Service Measurements • CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better

Applications and Runtime for multicore/manycore