A Few Thoughts on Programming Models for Massively Parallel Systems

A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Divisionwww.mcs.anl.gov/~{gropp,lusk}

Application Realities • The applications for massively parallel systems already exist • Because they take years to write • They are in a variety of models • MPI • Shared memory • Vector • Other • Challenges include expressing massive parallelism and giving natural expression to spatial and temporal locality.

What is the hardest problem? • (Overly simplistic statement): Program difficulty is directly related to the relative gap in latency and overhead • The biggest relative gap is the remote (MPI) gap, right?

Short Term • Transition existing applications • Compiler does it all • Model: vectorizing compilers (with feedback to retrain user) • Libraries (component software does it all) • Model: BLAS, CCA, “PETSc in PIM” • Take MPI or MPI/OpenMP codes only • Challenges • Remember history: Cray vs. STAR-100 vs. Attached Processors

Mid Term • Use variations or extensions of familiar languages • E.g., CoArray Fortran, UPC, OpenMP, HPF, Brook • Issues: • Local vs. global. Where is the middle (for hierarchical algorithms)? • Dynamic software (see libraries, CCA above); adaptive algorithms. • Support for modular or component oriented software.

Long Term • Performance • How much can we shield the user from managing memory? • Fault Tolerance • Particularly the impact on data distribution strategies • Debugging for performance and correctness • Intel lessons: lock-out makes it difficult to perform post mortems on parallel systems

Danger! Danger! Danger! • Massively parallel systems are needed for hard, not easy, problems • Programming models must make difficult problems possible; the focus must not be on making simple problems trivial. • E.g., fast dense matrix-matrix multiply isn’t a good measure of the suitability of a programming model.

Don’t Forget the 90/10 Rule • 90 % of the execution time in 10 % of the code • Performance focus emphasizes this 10% • The other 90% of the effort goes into the other 90% of the code • Modularity, expressivity, maintainability are important here

Supporting the Writing of Correct Programs • Deterministic algorithms should have an expression that is easy to prove is deterministic • This doesn’t mean enforcing a particular execution order or preventing the use of non-deterministic algorithms • Races are just too hard to avoid • Only “Hero” programmers may be reliable • Will we have “Structured parallel programming”? • Undisciplined access to shared objects is very risky • Like goto, the access to shared objects is both powerful and (as pointed out about goto) can simplify programs • The challenge repeated: what are the structured parallel programming constructs?

Concrete Challenges for Programming Models forMassively Parallel Systems • Completeness of Expression • How many advanced and emerging algorithms do we exclude? • How many legacy applications to we abandon? • Fault Tolerance • Expressing (or avoiding) problem decomposition • Correctness Debugging • Performance Debugging • I/O • Networking

Completeness of Expression • Can you efficiently implement MPI? • No, MPI is not the best or even a great model for WIMPS. But … • It is well defined • The individual operations are relatively simple • Parallel implementation issues are relatively well understood • MPI is designed for scalability (apps already running on thousands of processors) • Thus, any programming model should be able to implement MPI with a reasonable amount of effort. Consider MPI a “null test” of the power of a programming model. • Side effect: gives insight into how to transition existing MPI applications onto massively parallel systems • Gives some insight into the performance of many applications because it factors the problem into local and non-local performance issues.

Fault Tolerance • Do we require fault tolerance on every operation, or just on the application? • Checkpoints vs. “reliable computing” • Cost of fine vs. coarse grain guarantees • Software and performance costs! • What is the support for fault-tolerant algorithms? • Coarse-grain (checkpoint) vs. fine-grain (transactions) • Interaction with data decomposition • Regular decompositions vs. turning off dead processors

Problem Decomposition • Decomposition-centric (e.g., data-centric) programming models • Vectors and Streams are examples • Divide-and-conquer or recursive generation (Mou, Leiserson, many others) • More freedom in storage association (e.g., blocking to natural memory sizes; padding to eliminate false sharing)

Problem Decomposition Approaches • Very fine grain (I.e., ignore) • Individual words. Many think that this is the most general way. • You build a fast UMA-PRAM and I’ll believe it. • Low overhead and latency tolerance requires discovery of significant independent stuff • Special aggregates • Vectors, streams, tasks (object based decompositions) • Implicit by user-visible specification • E.g., Recursive subdivision

Application Kernels • Are needed to understand, evaluate candidates • Risks • Not representative • Over-simplified • Implicit information exploited in solution • (give example) • Under-simplified • Too hard to work with • Wrong evaluation metric • Result in “fragile” results: small changes in specification cause large changes in results • Called “Ill-posed” in numerical analysis • Widely recognized: “the only real benchmark is your own application”

Example Application Kernels • Bad: • Dense matrix-matrix multiply • Rarely good algorithmic choice in practice • Too easy • (Even if most compilers don’t do a good job with this) • Fixed-length FFT • Jacobi sweeps • Getting better: • Sparse matrix-vector multiply

Hand-tuned Compiler From Atlas Reality Check Enormous effort required to get good performance

Better Application Kernels • Even Better: • Sparse matrix assembly followed by matrix-vector multiply, on q of p processing elements, matrix elements are r x r blocks • Assembly: often a disproportionate amount of coding, stresses expressivity • q<p: supports hierarchical algorithms • Sparse matrix: many aspects of PDE simulation (explicit variable coefficient problems, Krylov methods and some preconditioners, multigrid); r x r typical for real multi-component problems. • Freedoms: data structure for sparse matrix representation (but bounded spatial overhead) • Best: • Your description here (please!)

Some Other Comments • Is a general purpose programming model needed? • Domain-specific environments • Combine languages, libraries, static, and dynamic tools • JIT optimization • Tools to construct efficient special-purpose systems • First steps in this direction • OpenMP (warts like “lastprivate” and all) • Name the newest widely-accepted, non-derivative programming language • Not T, Java, Visual Basic, Python

Challenges • The Processor in Memory (PIM) • Ignore the M(assive). How can we program the PIM? • Implicitly adopts the hybrid model; pragmatic if ugly • Supporting legacy applications • Implementing MPI efficiently at large scale • Reconsider SMP and DSM-style implementations (many current impls immature) • Supporting important classes of applications • Don’t pick a single model • Recall Dan Reed’s comment about loosing half the users with each new architecture • Explicitly make tradeoffs between features • Massive virtualization vs ruthless exploitation of compile-time knowledge • Interacting with the OS • Is the OS interface intrinsically nonscalable? • Is the OS interface scalable, but only with heroic levels of implementation effort?

Scalable System Services • 100000 Independent tasks • Are they truly independent? One property of related tasks is that: • The probability that a significant number will make the same (or any!) nonlocal system call (e.g., I/O request) in the same time interval >> random chance • What is the programming model’s role in • Aggregating nonlocal operations? • Providing a framework in which it is natural to write programs that make scalable calls to system services?

Cautionary Tales • Timers. Application programmer uses gettimeofday to time program. Each thread uses this to generate profiling data. • File systems. Some applications write one file/task (or one file/task/timestep) leading to zillions of files. How long does ls take? ls –lt? Don’t forget, all of the names are almost identical (worst-case sorting?) • Job startup. 100000 tasks start from their local executable, then all access a shared object (e.g., MPI_Init). What happens to the file system?

New OS Semantics? • Define value-return calls (e.g., file stat, gettimeofday) to allow on the fly aggregation • Defensive move for OS • You can always write a nonscalable program • Define state-update with scalable semantics • Collective operations • Thread safe • Avoid seek, provide write_at, read_at

A Few Thoughts on Programming Models for Massively Parallel Systems

A Few Thoughts on Programming Models for Massively Parallel Systems

Presentation Transcript

Massively Parallel Processors

Massively Parallel LDPC Decoding on GPU

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems

Massively Parallel Cloud Data Storage Systems

Programming Massively Parallel Graphics Processors

Massively Parallel/Distributed Data Storage Systems

Implicitly Parallel Programming Models

A Massively Parallel Architecture for Bioinformatics

Parallel Programming Models

A Few Thoughts on Privacy

Parallel Programming Models

A few thoughts on scanning strategy

Parallel Programming models I

Synthesizing Parallel Programming Models for Asymmetric Multi-core Systems

Parallel Programming Models

Aspects of practical parallel programming Parallel programming models Data parallel

A FEW THOUGHTS

A few thoughts on the way forward

A Few More Thoughts on Generic Programming

Thoughts on Models

Parallel Programming Models

Massively Parallel Cloud Data Storage Systems