480 likes | 586 Views
Memory Models: A Case for Rethinking Parallel Languages and Hardware. COE 502/CSE 661 Fall 2011 Parallel Processing Architectures Professor : Muhamed Mudawar , KFUPM. Presented by: Ahmad Hammad. outline. Abstract Introduction Sequential Consistency Data-Race-Free
E N D
Memory Models: A Case for RethinkingParallel Languages and Hardware COE 502/CSE 661 Fall2011 Parallel Processing Architectures Professor: MuhamedMudawar, KFUPM Presented by: Ahmad Hammad
outline • Abstract • Introduction • Sequential Consistency • Data-Race-Free • Industry Practice and Evolution • Lessons Learned • Implications for Languages and Hardware • Conclusion
Abstract • Writing correct parallel programs remains far more difficult than writing sequential programs. • most parallel programs are written using a shared-memory approach. • The memory model, specifies the meaning of shared variables. • involved a tradeoff between programmability and performance • One of the most challenging areas in both hardware architecture and programming language specification.
Abstract • Recent broad community-scale efforts have finally led to a convergence in this debate, • with popular languages such as Java and C++ and • most hardware vendors publishing compatible memory model specifications. • Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings that prevent achieving the vision of structured and safe parallel programming.
Abstract • This paper discusses the path to the above convergence, the hard lessons learned, and their implications. • A cornerstone of this convergence has been the view that the memory model should be a contract between the programmer and the system – • if the programmer writes disciplined (data-race-free) programs, the system will provide high programmability (sequential consistency) and performance.
Abstract • We discuss why this view is the best we can do with current • popular languages, and why it is inadequate moving forward. • In particular, we argue that parallel languages should not only promote high-level disciplined models, but they should also enforce the discipline. • for scalable and efficient performance, hardware should be co-designed to take advantage of and support such disciplined models.
Introduction • Most parallel programs today are written using threads and shared variables • There are a number of reasons why threads remain popular. • Even before the dominance of multicore, threads were already widely supported by ,most OS because they are also useful for other purposes.
Introduction • Direct HW support for shared-memory provides a performance advantage • The ability to pass memory references among threads makes it easier to share complex data structures. • shared-memory makes it easier to selectively parallelize application hot spots without complete redesign of data structures.
Memory Model • At he heart of the concurrency semantics of a shared memory program. • It defines the set of values a read in a program can return • It answers questions such as: • Is there enough synchronization to ensure a thread’s write will occur before another’s read? • Can two threads write to adjacent fields in a memory location at the same time?
Memory Model • We should have unambiguous MM • Defines an interface between a program and any hardware or software that may transform that program • The compiler, the virtual machine, or any dynamic optimizer • A complex memory model • parallel programs difficult to write • parallel programming difficult to teach.
Memory Model • Since it is an interface property, • has a long-lasting impact, affecting portability and maintainability of programs • overly constraining one may limit hardware and compiler optimization reducing performance.
Memory Model • The central role of the memory model has often been downplayed • Specifying a model that balances all desirable properties has proven surprisingly complex • programmability • performance • portability.
Memory Model • In 1980s and 1990s, the area received attention primarily in the hardware community • which explored many approaches, with little consensus • challenge for HW architects was the lack of clear MM at the programming language level • Unclear what programmers expected hardware to do.
Memory Model • hardware researchers proposed approaches to bridge this gap but there were no wide consensus from the software community. • Before 2000 • few programming environments addressed the issue clearly • widely used environments had unclear and questionable specifications • Even when specifications were relatively clear, they were often violated to obtain sufficient performance , • tended to be misunderstood even by experts, and were difficult to teach.
Memory Model • Since 2000, efforts to cleanly specify programming-language-level memory models, • Java and C++, with efforts now underway to adopt similar models for C and other languages. • In the process, issues created by hardware that had evolved and made it difficult to reconcile the need for a simple and usable programming model with that for adequate performance on existing hardware.
Memory Model • Today, the above languages and most hardware vendors have published (or plan to publish) compatible memory model specifications. • Although this convergence is a dramatic improvement over the past, it has exposed fundamental shortcomings in our parallel languages and their interplay with hardware.
Memory Model • After decades of research, it is still unacceptably difficult to describe what value a load can return without compromising modern safety guarantees or implementation methodologies. • To us, this experience has made it clear that solving the memory model problem will require a significantly new and cross-disciplinary research direction for parallel computing languages, hardware, and environments as a whole.
Sequential Consistency • The execution of a multithreaded program with shared variables is as follows. • Each step in the execution consists of choosing one of the threads to execute, • then performing the next step in that thread’s execution according to program order. • This process is repeated until the program as a whole terminates.
Sequential Consistency • Execution viewed as taking all steps executed by each thread, and interleaving them. • Possible results determined by trying all possible interleaving between instructions. • Instructions in a single thread must occur in same order in interleaving as in the thread. • According to Lamport this exc. Called sequentially consistent
Sequential Consistency Initially X=Y=0 Fig 1. Core of Dekker’s algorithm Can r1=r2=0? Fig. 2 Some executions for Fig. 1 Fig.2 gives three possible interleaving that together illustrate all possible final values of the non-shared variables r1 and r2.
Sequential Consistency • Sequential consistency gives the simplest possible meaning for shared variables, but suffers from several related flaws. • sequential consistency can be expensive to implement. • Almost all h/w, compilers violate SC today • reorder the two independent assignments, since scheduling loads early tends to hide the load latency. • Modern processors use a store buffer to avoid waiting for stores to complete. • Both the compiler and hardware optimization make outcome of r1 = r2 = 0 possible non-sequentially consistent execution.
Data-Race-Free • reordering accesses to unrelated variables in a thread • same meaning of single-threaded programs • affect multithreaded programs • Two memory accesses form a race if • different threads, try to access same location, at least one is a write • Occur one after another • Data-race-free-program No data race in any SC execution • Requires races be explicitly identified as synchronization • locks, use volatile variables in Java, atomics in C++
data-race-free • A program that does not allow a data race is said to be data-race-free. • The data-race-free model guarantees sequential consistency only for data-race-free programs. • For programs that allow data races, the model does not provide any guarantees.
Data-Race-Free Approach • Programmability • Simply identify the shared variables as synchronization variables • Performance • Specifies minimal constraints • Portability • Language must provide way to identify races • Hardware must provide way to preserve ordering on races • Compiler must translate correctly • synchronization accesses are performed indivisibly • ensure sequential consistency, in spite of those simultaneous accesses
Data-Race-Free Approach • Most important h/w and compiler optimizations allowed for ordinary accesses • Care must be taken at the explicitly identified synchronization accesses • data-race-free was formally proposed in 1990, • it did not see widespread adoption as a formal model in industry until recently.
Industry Practice and Evolution • Hardware Memory Models • High-Level Language Memory Models • The Java Memory Model • The C++ Memory Model
Hardware Memory Models • Most hardware supports relaxed models that are weaker than sequential consistency. • These models take an implementation- or performance-centric view, • desirable hardware optimizations drive the model specification • Different vendors had different models • Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray. • Various ordering guarantees + fences to impose other orders • Many ambiguities - due to complexity by design and unclear documentation
High-Level Language Memory Models • Ada was the first widely used high-level programming language • provides first class support for shared-memory parallel programming. • Advanced in treatment of memory semantics. • It used a style similar to data-race-free, requiring legal programs to be well-synchronized; • It did not fully formalize the notion of well-synchronized.
Pthreads and OpenMP • Most shared-memory programming enabled through libraries and APIs such as Posix threads and OpenMP. • Incomplete, ambiguous model specs • Memory model property of language, not library • unclear what compiler transformations are legal, • what the programmer is allowed to assume.
Pthreads and OpenMP • The Posix threads • specification indicates a model similar to data-race-free • Several inconsistent aspects, • Varying interpretations even among experts • The OpenMP model • Unclear and largely based on a flush instruction that is similar to fence instructions in h/w models
Java • Java provided first class support for threads • chapter specifically devoted to its memory model. • Bill Pugh publicized fatal flaws in Java model • This model was hard to interpret • Badly broken • Common compiler optimizations were prohibited • Many cases the model gave ambiguous or unexpected behavior
Java Memory Model Highlights • In 2000, Sun appointed an expert group to revise the model through the Java community process. • coordination through an open mailing list • Attracted a variety of participants, representing • Hardware, software, researchers and practitioners. • Took 5 years of debates • Many competing models • Final consensus model approved in 2005 for Java 5.0 • Quickly decided that the JMM must provide SC for data-race-free programs • Unfortunately, this model is very complex • Have some surprising behaviors, • Recently been shown to have a bug
Java Memory Model Highlights • Missing piece: Semantics for programs with data races • Java is meant to be a safe and secure language • It cannot allow arbitrary behavior for data races. • Java cannot have undefined semantics for ANY program • Must ensure safety/security guarantees to Limit damage from data races in untrusted code. • Problem: “safety/security, limited damage” in a multithreaded context were not clearly defined
Java Memory Model Highlights • Final model based on consensus, but complex • Programmers (must) use SC for data-race-free • But system designers must deal with complexity • Recent discovery of bugs in 2008
The C++ Memory Model • The situation in C++ was significantly different from Java. • The language itself provided no support for threads. • most concurrency with Pthreads. • Corresponding Microsoft Windows facilities. • combination of the C++ standard with the Pthread standard. • made it unclear to compiler writers precisely what they needed to implement
The C++ Memory Model • In 2005 a concurrency model for C++ initiated • The resulting effort expanded to include the definition of atomic operations, and the threads API itself. • Microsoft concurrently started its own internal effort • C++ easier than Java because it is unsafe • Data-race-free is possible model • BUT multicore New h/w optimizations, more search • Mismatched h/w, programming views became painful • Debate that SC for data-race-free inefficient with new hardware models
Lessons Learned • Data-race-free provides a simple and consistent model for threads and shared variables. • The current hardware implies three significant weaknesses: • Debugging Accidental introduction of a data race results in “undefined behavior,” • Synchronization variable performance on current hardware: ensuring SC in Java or C++ on current h/w complicate the model for experts. • Untrusted code There is no way to ensure data-race-freedom in untrusted code (Java).
Lessons Learned • There is a need for: • Higher-level disciplined models that enforce discipline • Deterministic Parallel Java • Hardware co-designed with high-level models • DeNovo hardware
Implications for Languages • Modern software needs to be both parallel and secure • Choice between the two should not be acceptable. • Disciplined shared-memory models • Simple • Enforceable • Expressive • Performance
Implications for Languages • Simpleenough to be easily teachable to undergraduates. • minimally provide sequential consistency to programs that obey the required discipline. • enable the enforcementof the discipline. • violations the discipline should not have undefined or complex semantics. • should be caught and returned back to the programmer as illegal; • General-purpose enough to expressimportant parallel algorithms and patterns • enable high and scalableperformance.
Data-race-free • A near-term discipline: continue with Data-race-free • Focus research on its enforcement • Ideally, language prohibits and eliminate data race by design • Else, runtime catches as exception
Deterministic-by-Default Parallel Programming • Many algorithms are deterministic • Fixed input gives fixed output • Standard model for sequential programs • Also holds for many transformative parallel programs • Parallelism not part of problem specification, only for performance • Language-based approaches with such goals Deterministic Parallel Java (DPJ) .
Implications for Hardware • Current h/w MM are an imperfect match for even current software MM • instruction set architecture changes • identify individual loads and stores as synchronization can alleviate some short-term problems. • An established ISA, is hard to change • when existing code works mostly adequately • there is no enough experience to document the benefits of the change.
Implications for Hardware • longer term perspective, more fundamental solution to the problem will emerge with a co-designed approach, • where future multicore hardware research evolves in concert with the software models research. • DeNovo project at Illinois in concert with DPJ • Design hardware to exploit disciplined parallelism • Simpler hardware • Scalable performance • Power/energy efficiency • Working with DPJ as example disciplined model
Conclusion • This paper gives a perspective based on work collectively spanning about thirty years. • Current way to specify concurrency semantics fundamentally broken • Best we can do is SC for data-race-free • Mismatched hardware-software • Simple optimizations give unintended consequences • Need • High-level disciplined models that enforce discipline • Hardware co-designed with high-level model
Authors • Sarita V. Adve (sadve@illinois.edu) is a professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. • Hans-J. Boehm (Hans.Boehm@hp.com) is a member of Hewlett-Packard's Exascale Computing Lab, Palo Alto, CA.