450 likes | 586 Views
Large Computer Systems. CE 140 A1/A2 27 August 2003. Rationale. Although computers are getting faster, the demands are also increasing at least as fast High-performance applications: simulations and modeling
E N D
Large Computer Systems CE 140 A1/A2 27 August 2003
Rationale • Although computers are getting faster, the demands are also increasing at least as fast • High-performance applications: simulations and modeling • Circuit speed cannot be increased indefinitely eventually, physical limits will be reached, and quantum mechanical effects will be a problem
Rationale • To handle larger problems, parallel computers are used • Machine level parallelism • Replicates entire CPUs or portions of them
Design Issues • What are the nature, size, and number of the processing elements? • What are the nature, size, and number of the memory modules? • How are the processing and memory elements interconnected? • What applications are to be run in parallel?
Grain Size • Coarse-grained parallelism • Unit of parallelism is larger • Running large pieces of software in parallel with little or no communication between the pieces • Example: large time-sharing systems • Fine-grained parallelism • Parallel programs with high degree of communication with each other
Tightly Coupled versus Loosely Coupled • Loosely coupled • Small number of large, independent CPUs that have relatively low-speed connections to each other • Tightly coupled • Smaller processing units that work closely together over high-bandwidth connections
Design Issues • In most cases • Coarse-grained is well suited for loosely coupled • Fine-grained is well suited for tightly coupled
Communication Models • In a parallel computer system, CPUs communicate with each other to exchange information • Two general types • Multiprocessors • Multicomputers
Multiprocessors • Shared Memory System • All processors may share a single virtual address space • Easy model for programmers • Global memory • any processor can access any memory module without intervention by another processor
Uniform Memory Access (UMA) Multiprocessor P1 P2 Pn INTERCONNECTION NETWORK M1 M2 Mk
Non-Uniform Memory Access (NUMA) Multiprocessor P1 M1 P2 M2 Pn Mn INTERCONNECTION NETWORK
Multicomputers • Distributed Memory System • Each CPU has its own private memory • Local/private memory – a processor cannot access a remote memory without the cooperation of the remote processor • Cooperation takes place in the form of a message passing protocol • Programming for a multicomputer is much more difficult than programming a multiprocessor
Distributed Memory System M1 M2 Mn P1 P2 Pn INTERCONNECTION NETWORK
Multiprocessors versus Multicomputers • Easier to program for multiprocessors • But multicomputers are much simpler and cheaper to build • Goal: large computer systems that combines the best of both worlds
Symmetric MultiProcessors (SMP) • Multiprocessor architecture where all processor can access all memory locations uniformly • Processors also share I/O • SMP classified as an UMA • SMP is simplest multiprocessor system • Any processor can execute either the OS kernel or user programs
SMP • Performance improves if programs can be run in parallel • Increased availability: if one processor breaks down, system does not stop running • Performance is also improved incrementally by adding processors • Does not scale well beyond 16 processors
Clusters • A group of whole computers connected together to function as a parallel computer • Popular implementation: Linux computers using Beowulf clustering software
Clusters • High availability – redundant resources • Scalability • Affordable – off-the-shelf parts
Clusters Cyborg Cluster Drexel University 32 nodes Dual P3 per node
Memory Organization • Shared Memory System (Multiprocessors) • each processor may also have a cache • convenient to have a global address space • For NUMA, accesses to the global address space may be slower than access to remote address space • Distributed Memory System (Multicomputers) • Private address space for each processor • Easiest way to connect computers into a large system • Data sharing is implemented through message passing
Issues • When processors share data, different processors must access the same value for a given data item • When a processor updates its cache, it must also update the caches of other processors, or invalidate other processors’ copies • shared data must be coherent
Cache Coherence • All cached copies of shared data must have the same value at all times
Snooping Caches • So-called because individual caches “snoop” on the bus
Write-Through Protocol • Write-Through with Update (Write Update) • Update cache and memory, update the cache of the rest of the processors • Write-Through without Update (Write Invalidate) • Update cache and memory, invalidate the cache of the rest of the processors
Write-Back Protocol • When a processor wants to write to a block, it must acquire exclusive control/ownership of the block • All other copies are invalidated • Block’s contents may be changed at any time • When another processor requests to read the block, owner processor sends block to requesting processor, and returns control of block to the memory module which updates block to contain the latest value
MESI Protocol • Popular write-back cache coherence protocol named after the initials of the four possible states of each cache line • Modified – entry is valid; memory is invalid; no copies exist • Exclusive – no other cache holds the line; memory is up to date • Shared – multiple caches may hold the line; memory is up to date • Invalid – cache entry does not contain valid data
Snoopy Cache Issues • Snoopy caches require broadcasting information over the bus leading to increased bus traffic if the system grows in size
Directory Protocols • Uses a directory that keeps tracks of locations where multiple copies of a given data item is present • Eliminates need for broadcast • If directory is centralized, the directory will be a bottleneck
Performance • According to Amdahl’s law, introducing machine parallelism will not have a significant effect on performance if the program cannot take advantage of the parallel architecture • Not all programs parallelize well
Scalability Issues • Bandwidth • Latency • Depends on topology