330 likes | 407 Views
Parallel Computer Architecture and Interconnect. Types of Parallel Computer Architecture. Two principal types : Shared memory multiprocessor
E N D
Types of Parallel Computer Architecture Two principal types: • Shared memory multiprocessor From a strictly hardware point of view, describes a computer architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical address space. • Distributed memory multicomputer In hardware, refers to network based memory access that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines. Ref slides from B. Wilkinson at UNC-Charlotte, 2006. and Kumar Introduction to parallel computing
Conventional Computer • Virtually all computers have followed a common machine model known as the von Neumann computer. Named after the Hungarian mathematician John von Neumann. • A von Neumann computer uses the stored-program concept. The CPU executes a stored program that specifies a sequence of read and write operations on the memory. Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.
Shared Memory Multiprocessor System Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module : • Multiple processors can operate independently but share the same memory resources. • Changes in a memory location effected by one processor are visible to all other processors. • Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.
UMA and NUMA. Uniform Memory Access (UMA): • Most commonly represented today by Symmetric Multiprocessor (SMP) machines • Equal access and access times to memory • Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Non-Uniform Memory Access (NUMA): • Often made by physically linking two or more SMPs • One SMP can directly access memory of another SMP • Not all processors have equal access time to all memories • Memory access across link is slower • If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA
Shared Memory Computers Advantages: • Global address space provides a user-friendly programming interface to memory • Data sharing between tasks is both fast and uniform Disadvantages: • Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can increases traffic on the shared memory-CPU path • Programmer responsibility for synchronization constructs that insure "correct" access of global memory and consistent data result. • Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.
Distributed Memory Computer • Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. • When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.
Distributed Memory Computer Advantages: • Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. • Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. • Cost effectiveness: can use commodity, off-the-shelf processors and networking like Ethenet. Disadvantages: • The programmer is responsible for many of the details associated with data communication between processors. • Non-uniform memory access (NUMA) times
Hybrid Computer • The largest and fastest computers in the world today employ both shared and distributed memory architectures. • The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. • The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another.
Real computer system have cache memory between the main memory and processors. Level 1 (L1) cache and Level 2 (L2) cache. Example Quad Shared Memory Multiprocessor Processor Processor Processor Processor L1 cache L1 cache L1 cache L1 cache L2 Cache L2 Cache L2 Cache L2 Cache Bus interface Bus interface Bus interface Bus interface Processor/ memory b us I/O interf ace Memory controller I/O b us Memory Shared memory
Programming Shared Memory ComputersSeveral possible ways • Use Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access shared and global variables declared. • Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread. Any thread can execute any subroutine at the same time as other threads. • Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to insure that more than one thread is not updating the same global address at any time. Example Pthreads
Use library functions and preprocessor compiler directives with a sequential programming language to declare shared variables and specify parallelism. Portable / multi-platform, including Unix and Windows NT platforms Available in C/C++ and Fortran implementations Can be very easy and simple to use Example OpenMP - industry standard. Consists of library functions, compiler directives, and environment variables - needs OpenMP compiler
Programming Distributed Memory Computers • Message passing model • Tasks exchange data through communications by sending and receiving messages. • Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation. • In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.
Interconnection Networks • Provide mechanisms for data transfer between processors or between processors and memory • Typical network built on links (physical media such as wires and fibers) and switches ( provide mapping from input to output). • Static network: point to point links • Dynamic network: switches and links. Communications are established dynamically among processors and memory.
Interconnection Networks • 2- and 3-dimensional meshes • Hypercube (not now common) • Using Switches: • Crossbar • Trees • Multistage interconnection networks
Bus-Based Networks Idea for broadcasting. Distance between any two nodes is constant. However, the bounded bandwidth of a bus place limitations on performance as number of nodes creases. Cache is used to improve access time. Scalable in cost but not in performance
Crossbar Networks pxb switches are employed. b>=p, non-blocking Lower bound on the total switches is (p^2). Not scalable in terms of cost Scalable in terms of performance
Multistage Networks Intermediate class of networks lies between these above two extremes. Omega network consists of log p stages, where p is the number of inputs (nodes) and output (memory).
Input i and output j, a link exists if: j = 2i 0<=i <=p/2 -1 or j = 2i +1-p, p/2<=i<=p-1 Left shift by one bit for input binary sequence
p inputs are fed into a set of p/2 switches. Each switch is in one of the two connection modes. 1). Pass-through: input are sent straight through to the outputs 2). Cross-over: Inputs are crossed over and then sent out.
AB link may be used by another pair of node to memory. Such communication will be blocked.
Completely-connected network is good in the sense that any two nodes can exchange message in a single step. Similar to crossbar network due to non-blocking property Star connected is similar to bus-based network. Communication between any pair of nodes is routed through the central processor. The central node is the bottleneck just like the bus.
Total nodes are 2^d In general, a d-dimensional hypercube is constructed by connecting corresponding nodes of two (d-1) dimensional hypercubes.
Tree-based network • Static tree network has a processing nodes at each node. • Dynamic tree has switching nodes at intermediate levels, processing nodes at leaf level. • To route a message, source node sends the message up the tree until reach the node that is the root of the subtree containing both sender and receiver.
Cache Coherence • In the case of shared-address-space computers, additional hardware is required to keep multiple copies of data consistent with each other. • Especially, for multiple processors how to ensure they all use the same updated values? • If a processor changes the value of its copy, one the two things must happen: • The other copies must be invalidated • The other copies must be updated 1b.28
Solid line represents processor actions and the dashed line presents coherence actions. • Read on invalid data transition to shared by accessing the remote value • A write on shared transition to dirty and c_write to label other copies to be invalid.