Parallel Computing Approaches & Applications

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008

Roadmap • Brief Overview of Parallel Computing • U. Maryland work: • PRAM prototype • XMT programming model • Current Standards: • MPI • OpenMP • Parallel Algorithms for Bayesian Networks, Gibbs Sampling

Why Parallel Computing? • Moore’s law will eventually end. • Processors are becoming cheaper. • Parallel computing provides significant time and memory savings!

Parallel Computing • Goal is to maximize efficiency / speedup: • Efficiency = Tseq / (P * Tpar) < 1 • Speedup = Tseq / * Tpar < P • In practice, time savings are substantial. • Assuming communication costs are low and processor idle time is minimized. • Orthogonal to: • Advancements in processor speeds • Code optimization and data structure techniques

Some issues to consider • Implicit vs. Explicit Parallelization • Distributed vs. Shared Memory • Homogeneous vs. Heterogeneous Machines • Static vs Dynamic Load Balancing • Other Issues: • Communication Costs • Fault-Tolerance • Scalability

Main Questions • How can we design parallel algorithms? • Need to think of places in the algorithm that can be made concurrent • Need to understand data dependencies (“critical path” = longest chain of dependent calculations) • How do we implement these algorithms? • An engineering issue with many different options

U. Maryland Work (Vishkin) • FPGA-Based Prototype of a PRAM-On-Chip Processor • Xingzhi Wen, Uzi Vishkin, ACM Computing Frontiers, 2008 Video: http://videos.webpronews.com/2007/06/28/supercomputer-arrives/

Goals • Find a parallel computing framework that: • is easy to program • gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code • supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming • fits current chip technology and scales with it They claim that PRAM/XMT can meet these goals

What is PRAM? • “Parallel Random Access Machine” • Virtual model of computation with some simplifying assumptions: • No limit to number of processors. • No limit on amount of shared memory. • Any number of concurrent accesses to a shared memory take the same time as a single access. • Simple model that can be analyzed theoretically • Eliminates focus on details like synchronization and communication • Different types: • EREW: Exclusive read, exclusive write. • CREW: Concurrent read, exclusive write. • CRCW: Concurrent read, concurrent write.

XMT Programming Model • XMT = “Explicit Multi-Threading” • Assumes CRCW PRAM • Multithreaded extension of C with 3 commands: • Spawn: starts parallel execution mode • Join: Resumes serial mode • Prefix-sum: atomic command for incrementing variable

RAM vs. PRAM

Simple Example Task: Copy nonzero elements from A to B $ is the thread-ID PS is Prefix-Sum

Architecture of PRAM prototype Shared PS unit: only way to communicate! MTCU: “Master Thread Control Unit”: handles sequential portions TCU clusters: handles parallel portions Shared cache 64 separate processors, each 75MHz 1 GB RAM, 32KB per cache (8 shared cache modules)

Envisioned Processor

Performance Results Using 64 procs Projected results 75Mhz -> 800Mhz

Human Results • “As PRAM algorithms are based on first principles that require relatively little background, a full day (300-minute) PRAM/XMT tutorial was offered to a dozen high-school students in September 2007. Followed up with only a weekly office-hour by an undergraduate assistant, some strong students have been able to complete 5 of 6 assignments given in a graduate course on parallel algorithms.” In other words: XMT is an easy way to program in parallel

Main Claims • “First commitment to silicon for XMT” • An actual attempt to implement a PRAM • “Timely case for the education enterprise” • XMT can be learned easily, even by high schoolers. • “XMT is a candidate for the Processor of the Future”

My Thoughts • Making parallel programming as pain-free as possible is desirable, and XMT makes a good attempt to do this. • Performance is a secondary goal. • Their technology does not seem to be ready for prime-time yet: • 75 Mhz processors • No floating point operations, no OS

MPI Overview • MPI (“Message Passing Interface”) is the standard for distributed computing • Basically it is an extension of C/Fortran that allows processors to send messages to each other. • A tutorial: http://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppt

OpenMP overview • OpenMP is the standard for shared memory computing • Extends C with compiler directives to denote parallel sections • Normally used for the parallelization of “for” loops. • Tutorial: http://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdf

Parallel Computing in AI/ML • Parallel Inference in Bayesian networks • Parallel Gibbs Sampling • Parallel Constraint Satisfaction • Parallel Search • Parallel Neural Networks • Parallel Expectation Maximization, etc.

Finding Marginals in Parallel through “Pointer Jumping”(Pennock, UAI 1998) • Each variable assigned to a separate processor • Processors rewrite conditional probabilities in terms of grandparent:

Algorithm

Evidence Propagation • “Arc Reversal” + “Evidence Absorption” • Step 1: Make evidence variable root node and create a preorder walk (can be done in parallel) • Step 2: Reverse arcs not consistent with that preorder walk (can be done in parallel), and absorb evidence • Step 3: Run the “Parallel Marginals” algorithm

Generalizing to Polytrees Note: Converting Bayesian Networks to Junction Trees can also be done in parallel Namasivayam, et. al. Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. 18th Int. Symp. on Comp. Arch. And High Perf. Comp., 2006.

Complexity • Time complexity: • O(log n) for polytree networks! • Assuming 1 processor per variable • n = # of processors/variables • O(r3wlog n) for arbitrary networks • r = domain size, w=largest cluster size

Parallel Gibbs Sampling • Running multiple parallel chains is trivial. • Parallelizing a single chain can be difficult: • Can use Metropolis-Hastings step to sample from joint distribution correctly. • Related ideas: Metropolis-coupled MCMC, Parallel Tempering, Population MCMC

Recap • Many different ways to implement parallel algorithms (XMT, MPI, OpenMP) • In my opinion, designing efficient parallel algorithms is the harder part. • Parallel computing in context of AI/ML still not fully explored!

Parallel Computing Approaches & Applications