1 / 46

Memory Models: A Case for Rethinking Parallel Languages and Hardware

This paper explores the challenges in writing parallel programs and the shortcomings of current memory models in achieving safe and structured parallel programming. It discusses the convergence in the industry and the need for disciplined models in both hardware and programming languages.

egabbard
Download Presentation

Memory Models: A Case for Rethinking Parallel Languages and Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Models: A Case for RethinkingParallel Languages and Hardware COE 502/CSE 661 Fall2011 Parallel Processing Architectures Professor: MuhamedMudawar, KFUPM Presented by: Ahmad Hammad

  2. outline • Abstract • Introduction • Sequential Consistency • Data-Race-Free • Industry Practice and Evolution • Lessons Learned • Implications for Languages and Hardware • Conclusion

  3. Abstract • Writing correct parallel programs remains far more difficult than writing sequential programs. • most parallel programs are written using a shared-memory approach. • The memory model, specifies the meaning of shared variables. • involved a tradeoff between programmability and performance • One of the most challenging areas in both hardware architecture and programming language specification.

  4. Abstract • Recent broad community-scale efforts have finally led to a convergence in this debate, • with popular languages such as Java and C++ and • most hardware vendors publishing compatible memory model specifications. • Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings that prevent achieving the vision of structured and safe parallel programming.

  5. Abstract • This paper discusses the path to the above convergence, the hard lessons learned, and their implications. • A cornerstone of this convergence has been the view that the memory model should be a contract between the programmer and the system – • if the programmer writes disciplined (data-race-free) programs, the system will provide high programmability (sequential consistency) and performance.

  6. Abstract • We discuss why this view is the best we can do with current • popular languages, and why it is inadequate moving forward. • In particular, we argue that parallel languages should not only promote high-level disciplined models, but they should also enforce the discipline. • for scalable and efficient performance, hardware should be co-designed to take advantage of and support such disciplined models.

  7. Introduction • Most parallel programs today are written using threads and shared variables • There are a number of reasons why threads remain popular. • Even before the dominance of multicore, threads were already widely supported by ,most OS because they are also useful for other purposes.

  8. Introduction • Direct HW support for shared-memory provides a performance advantage • The ability to pass memory references among threads makes it easier to share complex data structures. • shared-memory makes it easier to selectively parallelize application hot spots without complete redesign of data structures.

  9. Memory Model • At he heart of the concurrency semantics of a shared memory program. • It defines the set of values a read in a program can return • It answers questions such as: • Is there enough synchronization to ensure a thread’s write will occur before another’s read? • Can two threads write to adjacent fields in a memory location at the same time?

  10. Memory Model • We should have unambiguous MM • Defines an interface between a program and any hardware or software that may transform that program • The compiler, the virtual machine, or any dynamic optimizer • A complex memory model • parallel programs difficult to write • parallel programming difficult to teach.

  11. Memory Model • Since it is an interface property, • has a long-lasting impact, affecting portability and maintainability of programs • overly constraining one may limit hardware and compiler optimization  reducing performance.

  12. Memory Model • The central role of the memory model has often been downplayed • Specifying a model that balances all desirable properties has proven surprisingly complex • programmability • performance • portability.

  13. Memory Model • In 1980s and 1990s, the area received attention primarily in the hardware community • which explored many approaches, with little consensus • challenge for HW architects was the lack of clear MM at the programming language level • Unclear what programmers expected hardware to do.

  14. Memory Model • hardware researchers proposed approaches to bridge this gap but there were no wide consensus from the software community. • Before 2000 • few programming environments addressed the issue clearly • widely used environments had unclear and questionable specifications • Even when specifications were relatively clear, they were often violated to obtain sufficient performance , • tended to be misunderstood even by experts, and were difficult to teach.

  15. Memory Model • Since 2000, efforts to cleanly specify programming-language-level memory models, • Java and C++, with efforts now underway to adopt similar models for C and other languages. • In the process, issues created by hardware that had evolved and made it difficult to reconcile the need for a simple and usable programming model with that for adequate performance on existing hardware.

  16. Memory Model • Today, the above languages and most hardware vendors have published (or plan to publish) compatible memory model specifications. • Although this convergence is a dramatic improvement over the past, it has exposed fundamental shortcomings in our parallel languages and their interplay with hardware.

  17. Memory Model • After decades of research, it is still unacceptably difficult to describe what value a load can return without compromising modern safety guarantees or implementation methodologies. • To us, this experience has made it clear that solving the memory model problem will require a significantly new and cross-disciplinary research direction for parallel computing languages, hardware, and environments as a whole.

  18. Sequential Consistency • The execution of a multithreaded program with shared variables is as follows. • Each step in the execution consists of choosing one of the threads to execute, • then performing the next step in that thread’s execution according to program order. • This process is repeated until the program as a whole terminates.

  19. Sequential Consistency • Execution viewed as taking all steps executed by each thread, and interleaving them. • Possible results determined by trying all possible interleaving between instructions. • Instructions in a single thread must occur in same order in interleaving as in the thread. • According to Lamport this exc. Called sequentially consistent

  20. Sequential Consistency Initially X=Y=0 Fig 1. Core of Dekker’s algorithm Can r1=r2=0? Fig. 2 Some executions for Fig. 1 Fig.2 gives three possible interleaving that together illustrate all possible final values of the non-shared variables r1 and r2.

  21. Sequential Consistency • Sequential consistency gives the simplest possible meaning for shared variables, but suffers from several related flaws. • sequential consistency can be expensive to implement. • Almost all h/w, compilers violate SC today • reorder the two independent assignments, since scheduling loads early tends to hide the load latency. • Modern processors use a store buffer to avoid waiting for stores to complete. • Both the compiler and hardware optimization make outcome of r1 = r2 = 0 possible  non-sequentially consistent execution.

  22. Data-Race-Free • reordering accesses to unrelated variables in a thread • same meaning of single-threaded programs • affect multithreaded programs • Two memory accesses form a race if • different threads, try to access same location, at least one is a write • Occur one after another • Data-race-free-program No data race in any SC execution • Requires races be explicitly identified as synchronization • locks, use volatile variables in Java, atomics in C++

  23. data-race-free • A program that does not allow a data race is said to be data-race-free. • The data-race-free model guarantees sequential consistency only for data-race-free programs. • For programs that allow data races, the model does not provide any guarantees.

  24. Data-Race-Free Approach • Programmability • Simply identify the shared variables as synchronization variables • Performance • Specifies minimal constraints • Portability • Language must provide way to identify races • Hardware must provide way to preserve ordering on races • Compiler must translate correctly • synchronization accesses are performed indivisibly • ensure sequential consistency, in spite of those simultaneous accesses

  25. Data-Race-Free Approach • Most important h/w and compiler optimizations allowed for ordinary accesses • Care must be taken at the explicitly identified synchronization accesses • data-race-free was formally proposed in 1990, • it did not see widespread adoption as a formal model in industry until recently.

  26. Industry Practice and Evolution • Hardware Memory Models • High-Level Language Memory Models • The Java Memory Model • The C++ Memory Model

  27. Hardware Memory Models • Most hardware supports relaxed models that are weaker than sequential consistency. • These models take an implementation- or performance-centric view, • desirable hardware optimizations drive the model specification • Different vendors had different models • Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray. • Various ordering guarantees + fences to impose other orders • Many ambiguities - due to complexity by design and unclear documentation

  28. High-Level Language Memory Models • Ada was the first widely used high-level programming language • provides first class support for shared-memory parallel programming. • Advanced in treatment of memory semantics. • It used a style similar to data-race-free, requiring legal programs to be well-synchronized; • It did not fully formalize the notion of well-synchronized.

  29. Pthreads and OpenMP • Most shared-memory programming enabled through libraries and APIs such as Posix threads and OpenMP. • Incomplete, ambiguous model specs • Memory model property of language, not library • unclear what compiler transformations are legal, • what the programmer is allowed to assume.

  30. Pthreads and OpenMP • The Posix threads • specification indicates a model similar to data-race-free • Several inconsistent aspects, • Varying interpretations even among experts • The OpenMP model • Unclear and largely based on a flush instruction that is similar to fence instructions in h/w models

  31. Java • Java provided first class support for threads • chapter specifically devoted to its memory model. • Bill Pugh publicized fatal flaws in Java model • This model was hard to interpret • Badly broken • Common compiler optimizations were prohibited • Many cases the model gave ambiguous or unexpected behavior

  32. Java Memory Model Highlights • In 2000, Sun appointed an expert group to revise the model through the Java community process. • coordination through an open mailing list • Attracted a variety of participants, representing • Hardware, software, researchers and practitioners. • Took 5 years of debates • Many competing models • Final consensus model approved in 2005 for Java 5.0 • Quickly decided that the JMM must provide SC for data-race-free programs • Unfortunately, this model is very complex • Have some surprising behaviors, • Recently been shown to have a bug

  33. Java Memory Model Highlights • Missing piece: Semantics for programs with data races • Java is meant to be a safe and secure language • It cannot allow arbitrary behavior for data races. • Java cannot have undefined semantics for ANY program • Must ensure safety/security guarantees to Limit damage from data races in untrusted code. • Problem: “safety/security, limited damage” in a multithreaded context were not clearly defined

  34. Java Memory Model Highlights • Final model based on consensus, but complex • Programmers (must) use SC for data-race-free • But system designers must deal with complexity • Recent discovery of bugs in 2008

  35. The C++ Memory Model • The situation in C++ was significantly different from Java. • The language itself provided no support for threads. • most concurrency with Pthreads. • Corresponding Microsoft Windows facilities. • combination of the C++ standard with the Pthread standard. • made it unclear to compiler writers precisely what they needed to implement

  36. The C++ Memory Model • In 2005 a concurrency model for C++ initiated • The resulting effort expanded to include the definition of atomic operations, and the threads API itself. • Microsoft concurrently started its own internal effort • C++ easier than Java because it is unsafe • Data-race-free is possible model • BUT multicore New h/w optimizations, more search • Mismatched h/w, programming views became painful • Debate that SC for data-race-free inefficient with new hardware models

  37. Lessons Learned • Data-race-free provides a simple and consistent model for threads and shared variables. • The current hardware implies three significant weaknesses: • Debugging Accidental introduction of a data race results in “undefined behavior,” • Synchronization variable performance on current hardware: ensuring SC in Java or C++ on current h/w complicate the model for experts. • Untrusted code There is no way to ensure data-race-freedom in untrusted code (Java).

  38. Lessons Learned • There is a need for: • Higher-level disciplined models that enforce discipline • Deterministic Parallel Java • Hardware co-designed with high-level models • DeNovo hardware

  39. Implications for Languages • Modern software needs to be both parallel and secure • Choice between the two should not be acceptable. • Disciplined shared-memory models • Simple • Enforceable • Expressive • Performance

  40. Implications for Languages • Simpleenough to be easily teachable to undergraduates. • minimally provide sequential consistency to programs that obey the required discipline. • enable the enforcementof the discipline. • violations the discipline should not have undefined or complex semantics. • should be caught and returned back to the programmer as illegal; • General-purpose enough to expressimportant parallel algorithms and patterns • enable high and scalableperformance.

  41. Data-race-free • A near-term discipline: continue with Data-race-free • Focus research on its enforcement • Ideally, language prohibits and eliminate data race by design • Else, runtime catches as exception

  42. Deterministic-by-Default Parallel Programming • Many algorithms are deterministic • Fixed input gives fixed output • Standard model for sequential programs • Also holds for many transformative parallel programs • Parallelism not part of problem specification, only for performance • Language-based approaches with such goals Deterministic Parallel Java (DPJ) .

  43. Implications for Hardware • Current h/w MM are an imperfect match for even current software MM • instruction set architecture changes • identify individual loads and stores as synchronization can alleviate some short-term problems. • An established ISA, is hard to change • when existing code works mostly adequately • there is no enough experience to document the benefits of the change.

  44. Implications for Hardware • longer term perspective, more fundamental solution to the problem will emerge with a co-designed approach, • where future multicore hardware research evolves in concert with the software models research. • DeNovo project at Illinois in concert with DPJ • Design hardware to exploit disciplined parallelism • Simpler hardware • Scalable performance • Power/energy efficiency • Working with DPJ as example disciplined model

  45. Conclusion • This paper gives a perspective based on work collectively spanning about thirty years. • Current way to specify concurrency semantics fundamentally broken • Best we can do is SC for data-race-free • Mismatched hardware-software • Simple optimizations give unintended consequences • Need • High-level disciplined models that enforce discipline • Hardware co-designed with high-level model

  46. Authors • Sarita V. Adve (sadve@illinois.edu) is a professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. • Hans-J. Boehm (Hans.Boehm@hp.com) is a member of Hewlett-Packard's Exascale Computing Lab, Palo Alto, CA.

More Related