180 likes | 298 Views
Architecture Classifications. A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into four classes, based on how many instruction and data streams can be observed in the architecture. They are: SISD - Single Instruction, Single Data
E N D
Architecture Classifications • A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into four classes, based on how many instruction and data streams can be observed in the architecture. • They are: • SISD - Single Instruction, Single Data • Operate sequentially on a single stream of instructions in single memory. • Classic “Von Neumann” architecture. • Machines may still consist of multiple processors, operating on independent data - these can be considered as multiple SISD systems. • SIMD - Single Instruction, Multiple Data • A single instruction stream (broadcast to all PE*s), acting on multiple data. • The most common form of this architecture class are Vector processors. • These can deliver results several times faster than scalar processors. * PE = Processing Element
Architecture Classifications • MISD - Multiple instruction, Single data • There is a debate about whether the architecture with uniformly shared memory and separated cache is MISD or MIMD (MIMD is favoured) • No other practical implementations of this architecture. • MIMD - Multiple instruction, Multiple data • Independent instruction streams, acting on different (but related) data • Note the difference between multiple SISD and MIMD
Architecture Classifications • MIMD: SMP, NUMA, MPP, Cluster • SISD: Machine with a single scalar processor • SIMD: Machine with vector processors
Architecture Classifications • Shared memory (uniform memory access) • Processors share access to a common memory space. • Implemented over a shared memory bus or communication network. • Memory locks required • Local cache is critical: • If not, bus contention (or network traffic) reduces the systems efficiency. • For this reason, pure shared memory systems do not scale (Scalability is the measure of how well the system performance improves linearly to the number of processing elements) • Naturally, cache introduces problems of coherency (ensuring that stale cache lines are invalidated when other processors alter shared memory). • Support for critical sections are required Shared Memory Interconnect PE 0 PE n
Architecture Classifications • Shared memory (Non-uniform memory access) • PE may be fetching from local or remote memory - hence non-uniform access times. • NUMA • cc-NUMA (cache-coherent Non-Uniform Memory Access) • Groups of processors are connected together by a fast interconnect (SMP) • These are then connected together by a high-speed interconnect. • Global address space. Interconnect Shared Memory 1 Shared Memory m PE 1 PE n PE (m-1)n+1 PE m.n
Architecture Classifications • Distributed Memory • Each processor has it’s own local memory. • When processors need to exchange (or share data), they must do this through an explicit communication • Message passing (MPI language) • Typically larger latencies between PEs (especially if they communicate via over-network interconnections). • Scalability, however, is good if the problems can be sufficiently contained within PEs. • Typically, coarse-grained work units are distributed. Interconnect PE 0 PE n M 0 M n
In-processor Parallelism • Pipelines • Instruction pipelines • Reduces the idle time of hardware components. • Good performance with independent instructions. • Performing more operations per clock cycle. • Discrepancy between peak and actual performance often caused by pipeline effects • Difficult to keep pipelines full. • Branch prediction helps.
In-processor Parallelism • Vector architectures • Fast I/O - powerful busses and interconnections. • Large memory bandwidth and low latency access. • No cache because of above. • Perform operations involving large matrices, commonly encountered in engineering areas
In-processor Parallelism • Commodity processors increasingly provide performance as good as dedicated Vector processors • Price/performance is also far better. • Commodity processors now offer good performance for vectorizable code. • Explicit support for vectorization with SIMD instructions on COTS processors • Altivec on PowerPC • SSE (Streaming SIMD Extension) on x86
Multiprocessor Parallelism • Use multiple processors on the same program: • Divide workload up between processors. • Often achieved by dividing up a data structure. • Each processor works on it’s own data. • Typically processors need to communicate. • Shared or distributed memory is one approach • Explicit messaging is increasingly common. • Load balancing is critical for maintaining good performance.
Multiprocessor Parallelism Single Processor Symmetric Multiprocessor with Shared Memory CPU CPU CPU CPU Mem Mem MPP System Net CPU CPU CPU Mem Mem Mem
Clusters • Built using COTS components. • Brought about by improved processor speed as well as networking and switching technology. • Mass-produced commodity off-the-shelf (COTS) hardware, rather than expensive proprietary hardware built solely for supercomputers.
Clusters • Clusters are simpler to manage: • Single image, single identity • Often run familiar operating systems. • Linux is probably the most popular • Commodity compilers and support • Node for node swap-out on failure. • Can run multi-processor parallel tasks. • Or run sequential tasks for multiple users (job-level parallelism).
Clusters • Clustering of SMPs • Attractive method of achieving high performance. • SMPs reduce the network overhead
Parallel Efficiency • Main issues that effect parallel efficiency are: • Ratio of computation to communication • Higher computation usually yields better performance. • Communication bandwidth & latency • Latency has the biggest impact. • Scalability • How does the bandwidth & latency scale with the number of processors.
Dependency and Parallelism • Granularity of parallelism: the size of the computations that are being performed in parallel • Four types of parallelism (in order of granularity size) • Instruction-level parallelism (e.g. pipeline) • Thread-level parallelism (e.g. run a multi-thread java program) • Process-level parallelism (e.g. run an MPI job in a cluster) • Job-level parallelism (e.g. run a batch of independent single-processor jobs in a cluster)
Dependency and Parallelism • Dependency: If event A must occur before event B, then B is dependent on A • Two types of Dependency • Control dependency: waiting for the instruction which controls the execution flow to be completed • IF (X!=0) Then Y=1.0/X: Y has the control dependency on X!=0 • Data dependency: dependency because of calculations or memory access • Flow dependency: A=X+Y; B=A+C; • Anti-dependency: B=A+C; A=X+Y; • Output dependency: A=2; X=A+1; A=5;
Identifying Dependency • Draw a Directed Acyclic Graph (DAG) to identify the dependency among a sequence of instructions • Anti-dependency: a variable appears as a parent in a calculation and then as a child in a later calculation • Output dependency: a variable appears as a child in a calculation and then as a child again in a later calculation