190 likes | 407 Views
Parallelism. Can we make it faster?. The RAM model. The RAM (Random Access Machine) model of computation assumes: There is a single processing unit There is an arbitrarily large amount of memory Accessing any arbitrarily chosen (i.e. random) memory location takes unit time
E N D
Parallelism Can we make it faster?
The RAM model • The RAM (Random Access Machine) model of computation assumes: • There is a single processing unit • There is an arbitrarily large amount of memory • Accessing any arbitrarily chosen (i.e. random) memory location takes unit time • This simple model is very useful guide for algorithm design • For maximum efficiency, “tuning” to the particular hardware is required • The RAM model breaks down when the assumptions are violated • If an array is so large that only a portion of it fits in memory (the rest is on disk), very different sorting algorithms should be used
Approaches to parallelism • The basic question is, do the processing units share memory, or do they send messages to one another? • A thread consists of a single flow of control, a program counter, a call stack, and a small amount of thread-specific data • Threads share memory, and communicate by reading and writing to that memory • This is thread-based or shared-memory parallel processing • Java “out of the box” is thread-based • A process is a thread that has its own private memory • Threads (sometimes called actors) send messages to one another • This is message-passing parallel processing
The PRAM model • An obvious extension to the RAM model is the Parallel Random Access model, which assumes: • There are multiple processing units • There is an arbitrarily large amount of memory • Accessing any memory location takes unit time • The third assumption is “good enough” for many in-memory sequential programs, but not good enough for parallel programs • If the processing units share memory, then complicated and expensive synchronization mechanisms must be used • If the processing units do not share memory, then each has its own (fast) local memory, and communicates with other processes by sending messages to them (much slower--especially if over a network!) • Bottom line: Because there seems to be no way to meet the unit time assumption, the PRAM model is seriously broken!
The CTA model • The Candidate Type Architecture model makes these assumptions: • There are P standard sequential processors, each with its own local memory • One of the processors may be acting as “controller,” doing things like initialization and synchronization • Processors can access non-local memory over a communication network • Non-local memory is between 100 times and 10000 times slower to access than local memory (based on common architectures) • A processor can make only a very small number (maybe 1 or 2) of simultaneous non-local memory accesses
Consequences of CTA • The CTA model does not specify how many processors are available • The programmer does not need to plan for some specific number of processors • More processors may cause the code to execute somewhat more quickly • The CTA modes does specify a huge discrepancy between local and non-local memory access • The programmer should minimize the number of non-local memory accesses
Costs of parallelism • It would be great if having N processors meant our programs would run N times as fast, but... • There is overhead involved in setting up the parallelism, which we don’t need to pay for a sequential program • There are parts of any program that cannot be parallelized • Some processors will be idle because there is nothing for them to do • Processors have to contend for the same resources, such as memory, and may have to wait for one another
Overhead • Overhead is any cost incurred by the parallel algorithm but not by the corresponding sequential algorithm • Communication among threads and processes (a single thread has no other threads with which to communicate) • Synchronization is when one thread or process has to wait for results or events from another thread or process • Contention for a shared resource, such as memory • Java’s synchronized is used to wait for a lock to become free • Extra computation to combine the results of the various threads or processes • Extra memory may be needed to give each thread or process the memory required to do its job
Amdahl’s law • Some proportion P of a program can be made to run in parallel, while the remaining (1 - P) must remain sequential • If there are N processors, then the computation can be done in (1 - P) + P/N time • The maximum speedup is then 1 . (1 - P) + P/N • As N goes to infinity, the maximum speedup is 1/(1 - P) • For example, if P = 0.75, the maximum speedup is (1/0.25), or four times
Consequences of Amdahl’s law • If 75% of a process can be parallelized, and there are four processors, then the possible speedup is1 / ((1 - 0.75) + 0.75/4) = 2.286 • But with 40 processors--ten times as many--the speedup is only1 / ((1 - 0.75) + 0.75/40) = 3.721 • This has led many people (including Amdahl) to conclude that having lots of processors won’t help very much • However.... • For many problems, as the data set gets larger, • The inherently sequential part of the program remains (fairly) constant • Thus, the sequential proportion P becomes smaller • So: The greater the volume of data, the more speedup we can get
Idle time • Idle time results when • There is a load imbalance--one process may have much less work to do than another • A process must wait for access to memory or some other shared resource • Data is registers is most quickly accessed • Data in a cache is next most quickly accessed • A level 1 cache is the fastest, but also the smallest • A level 2 cache is larger, but slower • Memory--RAM--is much slower • Disk access is very much slower
Dependencies • A dependency is when one thread or process requires the result of another thread or process • Example: (a + b) * (c + d) • The additions can be done in parallel • The multiplication must wait for the results of the additions • Of course, at this level, the hardware itself handles the parallelism • Threads or processors that depend on results from other threads or processors must wait for those results
Parallelism in Java • Java uses the shared memory model • There are various competing Java packages (such as Akka and Kilim) to support message passing, but nothing yet in the official Java release • The programming language Erlang has developed the message passing approach • Scala is a Java competitor that supports both approaches • Scala’s message passing is based on Erlang
Concurrency in Java, I • Java Concurrency in Practice, by Brian Goetz, is the book to have if you need to do much concurrent programming in Java • The following 11 points are from his summary of basic principles • It’s the mutable state, stupid! • Make fields final unless they need to be mutable. • Immutable objects are automatically thread-safe. • Encapsulation makes it practical to manage the complexity. • Guard each mutable variable with a lock.
Concurrency in Java, II • Guard all variables in an invariant with the same lock. • Hold locks for the duration of compound actions. • A program that accesses a mutable variable from multiple threads without synchronization is a broken program. • Don’t rely on clever reasoning about why you don’t need to synchronize. • Include thread safety in the design process—or explicitly document that your class in not thread-safe. • Document your synchronization policy.
Functional programming • In functional programming (FP): • A function is a value • It can be assigned to variables • It can be passed as an argument to another function • It can be returned as the result of a function call • There are much briefer ways of writing a literal function • Scala example: a => 101 * a • A function acts like a function in mathematics • If you call it with the same arguments, you will get the same result. Every time. Guaranteed. • Functions have no side effects • Immutable values are strongly emphasized over mutable values • Some languages, such as Haskell, don’t allow mutable values at all • Computation proceeds by the application of functions, not by changing the state of mutable variables
Why functional programming? • Here are the three most important reasons that functional programming is better for concurrency than imperative programming: • Immutable values are automatically thread safe • Immutable values are automatically thread safe • Immutable values are automatically thread safe Why not functional programming? • Functional languages—Lisp, Haskell, ML, OCaml—have long been regarded as only for ivory-tower academics • Functional languages are “weird” (meaning: unfamiliar)
What’s happening now? • Moore’s law has ended • Instead of getting faster processors, we’re now getting more of them • Consequently, parallelism, and concurrency, have become much more important • After about ten years, CIS 120 is once again starting with OCaml • Python has gotten more functional • Other languages are getting more functional • Microsoft is starting to promote F# (based on ML?) • Java 8 will have some functional features • Scala is a hybrid object/functional language based on Java, and is freely available now
The End …for now