230 likes | 406 Views
A Few Thoughts on Programming Models for Massively Parallel Systems. Bill Gropp and Rusty Lusk Mathematics and Computer Science Division www.mcs.anl.gov/~{gropp,lusk}. Application Realities. The applications for massively parallel systems already exist Because they take years to write
E N D
A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Divisionwww.mcs.anl.gov/~{gropp,lusk}
Application Realities • The applications for massively parallel systems already exist • Because they take years to write • They are in a variety of models • MPI • Shared memory • Vector • Other • Challenges include expressing massive parallelism and giving natural expression to spatial and temporal locality.
What is the hardest problem? • (Overly simplistic statement): Program difficulty is directly related to the relative gap in latency and overhead • The biggest relative gap is the remote (MPI) gap, right?
Short Term • Transition existing applications • Compiler does it all • Model: vectorizing compilers (with feedback to retrain user) • Libraries (component software does it all) • Model: BLAS, CCA, “PETSc in PIM” • Take MPI or MPI/OpenMP codes only • Challenges • Remember history: Cray vs. STAR-100 vs. Attached Processors
Mid Term • Use variations or extensions of familiar languages • E.g., CoArray Fortran, UPC, OpenMP, HPF, Brook • Issues: • Local vs. global. Where is the middle (for hierarchical algorithms)? • Dynamic software (see libraries, CCA above); adaptive algorithms. • Support for modular or component oriented software.
Long Term • Performance • How much can we shield the user from managing memory? • Fault Tolerance • Particularly the impact on data distribution strategies • Debugging for performance and correctness • Intel lessons: lock-out makes it difficult to perform post mortems on parallel systems
Danger! Danger! Danger! • Massively parallel systems are needed for hard, not easy, problems • Programming models must make difficult problems possible; the focus must not be on making simple problems trivial. • E.g., fast dense matrix-matrix multiply isn’t a good measure of the suitability of a programming model.
Don’t Forget the 90/10 Rule • 90 % of the execution time in 10 % of the code • Performance focus emphasizes this 10% • The other 90% of the effort goes into the other 90% of the code • Modularity, expressivity, maintainability are important here
Supporting the Writing of Correct Programs • Deterministic algorithms should have an expression that is easy to prove is deterministic • This doesn’t mean enforcing a particular execution order or preventing the use of non-deterministic algorithms • Races are just too hard to avoid • Only “Hero” programmers may be reliable • Will we have “Structured parallel programming”? • Undisciplined access to shared objects is very risky • Like goto, the access to shared objects is both powerful and (as pointed out about goto) can simplify programs • The challenge repeated: what are the structured parallel programming constructs?
Concrete Challenges for Programming Models forMassively Parallel Systems • Completeness of Expression • How many advanced and emerging algorithms do we exclude? • How many legacy applications to we abandon? • Fault Tolerance • Expressing (or avoiding) problem decomposition • Correctness Debugging • Performance Debugging • I/O • Networking
Completeness of Expression • Can you efficiently implement MPI? • No, MPI is not the best or even a great model for WIMPS. But … • It is well defined • The individual operations are relatively simple • Parallel implementation issues are relatively well understood • MPI is designed for scalability (apps already running on thousands of processors) • Thus, any programming model should be able to implement MPI with a reasonable amount of effort. Consider MPI a “null test” of the power of a programming model. • Side effect: gives insight into how to transition existing MPI applications onto massively parallel systems • Gives some insight into the performance of many applications because it factors the problem into local and non-local performance issues.
Fault Tolerance • Do we require fault tolerance on every operation, or just on the application? • Checkpoints vs. “reliable computing” • Cost of fine vs. coarse grain guarantees • Software and performance costs! • What is the support for fault-tolerant algorithms? • Coarse-grain (checkpoint) vs. fine-grain (transactions) • Interaction with data decomposition • Regular decompositions vs. turning off dead processors
Problem Decomposition • Decomposition-centric (e.g., data-centric) programming models • Vectors and Streams are examples • Divide-and-conquer or recursive generation (Mou, Leiserson, many others) • More freedom in storage association (e.g., blocking to natural memory sizes; padding to eliminate false sharing)
Problem Decomposition Approaches • Very fine grain (I.e., ignore) • Individual words. Many think that this is the most general way. • You build a fast UMA-PRAM and I’ll believe it. • Low overhead and latency tolerance requires discovery of significant independent stuff • Special aggregates • Vectors, streams, tasks (object based decompositions) • Implicit by user-visible specification • E.g., Recursive subdivision
Application Kernels • Are needed to understand, evaluate candidates • Risks • Not representative • Over-simplified • Implicit information exploited in solution • (give example) • Under-simplified • Too hard to work with • Wrong evaluation metric • Result in “fragile” results: small changes in specification cause large changes in results • Called “Ill-posed” in numerical analysis • Widely recognized: “the only real benchmark is your own application”
Example Application Kernels • Bad: • Dense matrix-matrix multiply • Rarely good algorithmic choice in practice • Too easy • (Even if most compilers don’t do a good job with this) • Fixed-length FFT • Jacobi sweeps • Getting better: • Sparse matrix-vector multiply
Hand-tuned Compiler From Atlas Reality Check Enormous effort required to get good performance
Better Application Kernels • Even Better: • Sparse matrix assembly followed by matrix-vector multiply, on q of p processing elements, matrix elements are r x r blocks • Assembly: often a disproportionate amount of coding, stresses expressivity • q<p: supports hierarchical algorithms • Sparse matrix: many aspects of PDE simulation (explicit variable coefficient problems, Krylov methods and some preconditioners, multigrid); r x r typical for real multi-component problems. • Freedoms: data structure for sparse matrix representation (but bounded spatial overhead) • Best: • Your description here (please!)
Some Other Comments • Is a general purpose programming model needed? • Domain-specific environments • Combine languages, libraries, static, and dynamic tools • JIT optimization • Tools to construct efficient special-purpose systems • First steps in this direction • OpenMP (warts like “lastprivate” and all) • Name the newest widely-accepted, non-derivative programming language • Not T, Java, Visual Basic, Python
Challenges • The Processor in Memory (PIM) • Ignore the M(assive). How can we program the PIM? • Implicitly adopts the hybrid model; pragmatic if ugly • Supporting legacy applications • Implementing MPI efficiently at large scale • Reconsider SMP and DSM-style implementations (many current impls immature) • Supporting important classes of applications • Don’t pick a single model • Recall Dan Reed’s comment about loosing half the users with each new architecture • Explicitly make tradeoffs between features • Massive virtualization vs ruthless exploitation of compile-time knowledge • Interacting with the OS • Is the OS interface intrinsically nonscalable? • Is the OS interface scalable, but only with heroic levels of implementation effort?
Scalable System Services • 100000 Independent tasks • Are they truly independent? One property of related tasks is that: • The probability that a significant number will make the same (or any!) nonlocal system call (e.g., I/O request) in the same time interval >> random chance • What is the programming model’s role in • Aggregating nonlocal operations? • Providing a framework in which it is natural to write programs that make scalable calls to system services?
Cautionary Tales • Timers. Application programmer uses gettimeofday to time program. Each thread uses this to generate profiling data. • File systems. Some applications write one file/task (or one file/task/timestep) leading to zillions of files. How long does ls take? ls –lt? Don’t forget, all of the names are almost identical (worst-case sorting?) • Job startup. 100000 tasks start from their local executable, then all access a shared object (e.g., MPI_Init). What happens to the file system?
New OS Semantics? • Define value-return calls (e.g., file stat, gettimeofday) to allow on the fly aggregation • Defensive move for OS • You can always write a nonscalable program • Define state-update with scalable semantics • Collective operations • Thread safe • Avoid seek, provide write_at, read_at