Parallel and Distributed Computing

Parallel and Distributed Computing References Introduction to Parallel Computing, Second Edition Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Publisher: Addison Wesley , ISBN: 0-201-64865-2 Publication Date: January 16, 2003 Tutorials about Cluster , Grid, Cloud Computing, Internet Sites

Main Objective of this chapter is to provide a survey about innovation on parallel computing with respect to the architectures Chapter 1Introduction to Parallel Computing

von Neumann Architecture Common machine model for over 40 years Stored-program concept CPU executes a stored program A sequence of read and write operations on the memory (RAM) Order of operations is sequential

A More Detailed Architecture based on von Neumann Model

Old von Neumann Computer

CISC von Neumann Computer CISC ( Complex Instruction Set Computer with a single bus system Harvard (RISC) architecture utilizes two buses, separate data bus and address bus RISC ( Reduced Instruction Set Computer) They are SISD machines Single Instruction Stream Single on Data Stream

John von Neumann December 28, 1903 – February 8, 1957 Hungarian mathematician Got his Ph.D. at 23

Motivations for Parallel Computing Fundamental limits on single processor speed Disparity between CPU & memory speed Performance Mismatch Problem Distributed data communications Need for very large scale computing platforms

Fundamental Limits – Cycle Speed Cray 1: 12ns 1975 Cray 2: 6ns 1986 Cray T-90 2ns 1997 Intel PC 1ns 2000 Today’s PC 0.3ns 2006 (P4) Speed of light: 30cm in 1ns Signal travels about 10 times slower

Moore’s Law Moore’s observation in 1965: number of transistors per square inch on integrated circuits had doubled every year Moore’s revised observation in 1975: the space slowed down a bit, but data density had doubled approximately every 18 months How about the future? (price of computing power falls by a half every 18 months?)

Moore’s Law – Held for Now

CPU and Memory Speeds In 20 years, CPU speed (clock rate) has increased by a factor of 1000 DRAM speed has increased only by a factor of smaller than 4 How to feed data faster enough to keep CPU busy? CPU speed: 1-2 ns DRAM speed: 50-60 ns Cache: 10 ns

Memory Access and CPU Speed

CPU, Memory, and Disk Speed

Possible Solutions A hierarchy of successively fast memory devices (multilevel caches) Location of data reference (data Locality) Efficient programming can be an issue Parallel systems may provide larger aggregate cache higher aggregate bandwidth to the memory system

Distributed Data Communications Data may be collected and stored at different locations It is expensive to bring them to a central location for processing Many computing assignments many be inherently parallel Privacy issues in data mining and other large scale commercial database manipulations

Distributed Data Communications

Why Use Parallel Computing Save time – wall clock time – many processors work together Solve larger problems – larger than one processor’s CPU and memory can handle Provide concurrency – do multiple things at the same time: online access to databases, search engine Google’s 4,000 PC servers are one of the largest in clusters the world

Other Reasons for Parallel Computing Taking advantages of non-local resources – using computing resources on a wide area network, or even internet (grid & cloud computing) Remote Access Resources Cost savings – using multiple “cheap” computing resources instead of a high-end CPU Overcoming memory constraints – for large problems, using memories of multiple computers may overcome the memory constraint obstacle

Need for Large Scale Modeling Weather forecasting Ocean modeling Oil reservoir simulations Car and airplane manufacture Semiconductor simulation Pollution tracking Large commercial databases Aerospace NASA microgravity modeling)

Issues in Parallel Computing (main issues of the course) Design of parallel computers Design of efficient parallel algorithms Methods for evaluating parallel algorithms Parallel computer languages Parallel programming tools Portable parallel programs Automatic programming of parallel computers Education of parallel computing philosophy

*PARALLEL ARCHITECTURES A parallel computer can be characterized as a system where multiple processing elements cooperate in executing one or more tasks. The numerous existing parallel architectures and their different approaches require some kind of classification. Flynn’s Taxonomy the design of a computer is characterized by: flow (or stream) of instructions, and flow (or stream) of data. Flynn’s taxonomy classifies according to the multiplicity of the instruction and the data flows.

PARALLEL ARCHITECTURES SISD (Single Instruction Single Data) architecture Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution Corresponds to the conventional sequential computer. oldest and even today, the most common type of computer Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.

PARALLEL ARCHITECTURES MISD (Multiple Instruction Single Data) A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. Few type of this class of parallel computer have ever existed. The experimental Carnegie-Mellon C.mmp computer (1971). Some uses might be: multiple frequency filters operating on a signal stream multiple cryptography algorithms attempting to crack a single coded message. Although MISD does not seem to be meaningful pipeline architectures, as found in all modern processors, can be considered MISD

Parallel & Distributed ComputingHARDWARE PLATFORMS Three principal classes of parallel computers used today: SIMD machines MIMD machines Shared-memory multiprocessors, Distributed memory multicomputers, and Networked Workstations Hardware Platforms Distributed Computers Parallel Computers SIMD Machines Shared Memory (Multiprocessors) Networked Workstations Distributed Memory (Multi-computers) Taxonomy of Parallel and Distributed Computers

SIMD Machines Main characteristic of SIMD machines is that: all processors must execute the same instruction (on a different data element ) at any instant in the program's execution. These machines execute in "lock-step" synchronous to a global clock All processors must complete execution of the current instruction before any is allowed to proceed to the next instruction.

SIMD Machines SIMD machines typically contain more, simpler, processors (Processing Elements (PE)) Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. Two varieties: Processor Arrays and Vector Pipelines: Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.

Processor Arrays- ILLIAC IV

VECTOR PROCESSOR Vs ARRAY PROCESSOR Vector processor is a computer with built-in instructions that perform multiple calculations on a vector of data (One dimensional arrays) Vector processor is used to solve the same or similar problems as an array processor. A vector processor passes vector of data to functional units. An array processor passes each element of a vector to a different arithmetic unit.

MIMD Machines most common type of parallel computer, and modern computers fall into this category Every Processor Element (PE) has its own Control Unit (CU). Examples most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.

MIMD Machines PEs operate independently of each other and execute independent instructions on different data streams Execution can be deterministic ornon-deterministic non-deterministic multiple ways of processing the same input, without any specification of which one will be taken to arrive at outcomes

MIMD Machines A parallel execution of a global task (i.e., the collaboration of PEs is achieved through synchronization or asynchronous and data exchange between the PEs via the interconnection network An MIMD architecture can simulate an SIMD architecture by executing the same program on all the processors, which is called SPMD (Single Program Multiple Data) mode. deterministic ornon-deterministic execution?!

Memory Architectures For both the design and the programming model of a parallel system, the memory organization is a very important issue not considered by Flynn’s taxonomy The memory organization of a parallel system can divide into two aspects: location and access policy of the memory. Regarding the location, memory is either centralizedor distributedwith the processors. systems with a common memory, distributed or not, are called Shared-memorymachines all processors have full access to the memory. systems where there is no such shared memory, are called Distributed-memorymachines processors have to use explicit means of communication like message passing.

Memory Architectures With these two aspects of memory organization, two common memory organizations: Centralized Memory Multiprocessor a common memory, Distributed Memory Multiprocessor Shared Memory (distributed-Shred Memory) Message Passing Multi-computer Symmetric Multiprocessor (SMP) Distributed Shard Multiprocessor (DSM)

Shared Memory Multi-Processor Systems With these two aspects of memory organization, two common memory organizations: Centralized Memory Multiprocessor a common memory, Two different types of shared memory multi-processor systems: Symmetric Multi-Processors (SMPs) Distributed Shared Memory (DSM). In both cases every processor can read and write to any portion of the system’s memory (Shared Variable). Cache coherence protocols are needed in both cases Symmetric Multiprocessor (SMP) Distributed Shard Multiprocessor (DSM)

Shared Memory Multi-Processor Systems SMP system; a single pool of memory, which is equally fast for all processors, referred to as a Uniform Memory Access (UMA). DSM system; multiple pools of memory and the latency to access memory depends on the relative position of the processor and memory referred to as cache coherent Non-Uniform Memory Access (NUMA). each processor has local memory (the lowest latency) while everything else is classified as remote memory and is slower to access.

Symmetric Multi-Processors (SMPs) Shared bus provides a single point of arbitration Cache coherency is relatively straight forward When a processor needs to broadcast a cache coherency message, it simply sends that information over the bus, and all other processors can receive it This is the model that Intel has pursued since the advent of the Pentium Pro, and continues today with the entire Xeon line.

Distributed Shared Memory (DSM) Scale more effectively because local memory can be accessed rapidly Point to point interconnects are a natural fit, because the bandwidth grows proportionally to the number of processors and amount of memory in the system. The biggest downside of distributed memory is that they only work well if the Operating System is “NUMA-aware” and can efficiently place memory and processes. intelligent OS would place data and schedule processes such that each processor only accesses local memory and never needs remote data. Within DSM, variety of different topologies can be used: crossbar, fat tree, torus, ring, hypercube, etc. Moreover, these topologies can be combined into hybrid topologies: a tree of rings, a crossbar of meshes, etc.

Memory Hierarchies of Shared Memory Multi-Processor Systems

Distributed-Memory Multicomputers Multi-computers do not support shared variables. Rather, all communications between processors must occur via message passing. Message-passing libraries are provided to send and receive messages between processors. Examples the Cray T3D, NCube/Ten, and Intel Paragon. Large multicomputers may contain hundreds of processors no need for cache coherence protocol Network Of Workstation (NOW) {Cluster}: Each processor is stand alone computer Using LAN network to connect computers Homogenous & Heterogeneous

Distinction between Shared-Memory Multiprocessors and Multicomputers Distinction because Common address space provided by shared memory machines allows global data structures referenced by more than one processor to be used Distributed address space provided by distributed memory machines allows local data structures. Many memory management techniques may be used in shared-memory machines.

“Parallel” Computing Traditional supercomputers SIMD, MIMD, pipelines Tightly coupled shared memory (SMP, DSM) Loosely Coupled Distributed Memory (Multicomputer) Bus level connections Expensive to buy and to maintain Very high starting cost Expensive hardware Expensive software High maintenance Expensive to upgrade Cooperating networks of computers

Parallel and Distributed Computing