660 likes | 735 Views
Principles of High Performance Computing (ICS 632). Concurrent Computers. Concurrency and Computers. Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes. Concurrency and Computers. Concurrency occurs at many levels in computer systems
E N D
Principles of High Performance Computing (ICS 632) Concurrent Computers
Concurrency and Computers • Concurrency occurs at many levels in computer systems • Within a CPU • Within a “Box” • Across Boxes
Concurrency and Computers • Concurrency occurs at many levels in computer systems • Within a CPU • Within a “Box” • Across Boxes
Concurrency within a “Box” • Two main techniques • SMP • Multi-core • Let’s look at both of them
Multiple CPUs • We have seen that there are many ways in which a single-threaded program can in fact achieve some amount of true concurrency in a modern processor • ILP, vector instructions • On a hyper-threaded processors, a single-threaded program can also achieve some amount of true concurrency • But there are limits to these techniques, and many systems provide increased true concurrency by using multiple CPUs
SMPs • Symmetric Multi-Processors • often mislabeled as “Shared-Memory Processors”, which has now become tolerated • Processors are all connected to a single memory • Symmetric: each memory cell is equally close to all processors • Many dual-proc and quad-proc systems • e.g., for servers P P 1 n Main memory
Multi-core processors • We’re about to enter an era in which all computers will be SMPs • This is because soon all processors will be multi-core • Let’s look at why we have multi-core processors
Moore’s Law • Many people interpret Moore’s law as “computer gets twice as fast every 18/24 months” • which is not true • The law is about transistor density • This wrong interpretation is no longer true • We should have 20GHz processors right now • And we don’t!
No more Moore? • We are used to getting faster CPUs all the time • We are used for them to keep up with more demanding software • Known as “Andy giveth, and Bill taketh away” • Andy Grove • Bill Gates • It’s a nice way to force people to buy computers often • But basically, our computers get better, do more things, and it just happens automatically • Some people call this the “performance free lunch” • Conventional wisdom: “Not to worry, tomorrow’s processors will have even more throughput, and anyway today’s applications are increasingly throttled by factors other than CPU throughput and memory speed (e.g., they’re often I/O-bound, network-bound, database-bound).”
Commodity improvements • There are three main ways in which commodity processors keep improving: • Higher clock rate • More aggressive instruction reordering and concurrent units • Bigger/faster caches • All applications can easily benefit from these improvements • at the cost of perhaps a recompilation • Unfortunately, the first two are hitting their limit • Higher clock rate lead to high heat, power consumption • No more instruction reordering without compromising correctness
Is Moore’s laws not true? • Ironically, Moore’s law is still true • The density indeed still doubles • But its wrong interpretation is not • Clock rates do not doubled any more • But we can’t let this happen: computers have to get more powerful • Therefore, the industry has thought of new ways to improve them: multi-core • Multiple CPUs on a single chip • Multi-core adds another level of concurrency • But unlike, say multiple functional units, hard to compile for them • Therefore, programmers need to be trained to develop code for multi-core platforms • See ICS432
Shared Memory and Caches? • When building a shared memory system with multiple processors / cores, one key question is: where does one put the cache? • Two options P P n 1 P P 1 n Switch $ $ $ Inter connection network Main memory Main memory Shared Cache Private Caches
Shared Caches • Advantages • Cache placement identical to single cache • Only one copy of any cached block • Can’t have different values for the same memory location • Good interference • One processor may prefetch data for another • Two processors can each access data within the same cache block, enabling fine-grain sharing • Disadvantages • Bandwidth limitation • Difficult to scale to a large number of processors • Keeping all processors working in cache requires a lot of bandwidth • Size limitation • Building a fast large cache is expensive • Bad interference • One processor may flush another processor’s data
Shared Caches • Shared caches have known a strange evolution • Early 1980s • Alliant FX-8 • 8 processors with crossbar to interleaved 512KB cache • Encore & Sequent • first 32-bit microprocessors • two procs per board with a shared cache • Then disappeared • Only to reappear in recent MPPs • Cray X1: shared L3 cache • IBM Power 4 and Power 5: shared L2 cache • Typical multi-proc systems do not use shared caches • But they are common in multi-core systems
Caches and multi-core Core #1 Core #2 L1 Cache L1 Cache • Typical multi-core architectures use distributed L1 caches • But lower levels of caches are shared Core #1 Core #2 L1 Cache L1 Cache L2 Cache
Multi-proc & multi-core systems Processor #1 Processor #2 Core #1 Core #2 Core #1 Core #2 L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache RAM
Private caches • The main problem with private caches is that of memory consistency • Memory consistency is jeopardized by having multiple caches • P1 and P2 both have a cached copy of a data item • P1 write to it, possibly write-through to memory • At this point P2 owns a stale copy • When designing a multi-processor system, one must ensure that this cannot happen • By defining protocols for cache coherence
Snoopy Cache-Coherence State Address Data • The memory bus is a broadcast medium • Caches contain information on which addresses they store • Cache Controller “snoops” all transactions on the bus • A transaction is a relevant transaction if it involves a cache block currently contained in this cache • Take action to ensure coherence • invalidate, update, or supply value Pn P0 $ $ bus snoop memory bus memory op from Pn Mem Mem
Limits of Snoopy Coherence Assume: 4 GHz processor => 16 GB/s inst BW per processor (32-bit) => 9.6 GB/s data BW at 30% load-store of 8-byte elements Suppose 98% inst hit rate and 90% data hit rate => 320 MB/s inst BW per processor => 960 MB/s data BW per processor => 1.28 GB/s combined BW Assuming 10 GB/s bus bandwidth 8 processors will saturate the bus MEM ° ° ° MEM 1.28 GB/s ° ° ° cache cache 25.6 GB/s PROC PROC
Sample Machines • Intel Pentium Pro Quad • Coherent • 4 processors • Sun Enterprise server • Coherent • Up to 16 processor and/or memory-I/O cards
Directory-based Coherence • Idea: Implement a “directory” that keeps track of where each copy of a data item is stored • The directory acts as a filter • processors must ask permission for loading data from memory to cache • when an entry is changed the directory either update or invalidate cached copies • Eliminate the overhead of broadcasting/snooping, a thus bandwidth consumption • But is slower in terms of latency • Used to scale up to numbers of processors that would saturate the memory bus
Example machine • SGI Altix 3000 • A node contains up to 4 Itanium 2 processors and 32GB of memory • Uses a mixture of snoopy and directory-based coherence • Up to 512 processors that are cache coherent (global address space is possible for larger machines)
Sequential Consistency? • A lot of hardware and technology to ensure cache coherence • But the sequential consistency model may be broken anyway • The compiler reorders/removes code • Prefetch instructions cause reordering • The network may reorder two write messages • Basically, a bunch of things can happen • Virtually all commercial systems give up on the idea of maintaining strong sequential consistency
Weaker models • The programmer must program with weaker memory models than Sequential Consistency • Done with some rules • Avoid race conditions • Use system-provided synchronization primitives • We will see how to program shared-memory machines • ICS432 is “all” about this • We’ll just do a brief “review” in 632
Concurrency and Computers • We will see computer systems designed to allow concurrency (for performance benefits) • Concurrency occurs at many levels in computer systems • Within a CPU • Within a “Box” • Across Boxes
Multiple boxes together • Example • Take four “boxes” • e.g., four Intel Itaniums bought at Dell • Hook them up to a network • e.g., a switch bought at CISCO, Myricom, etc. • Install software that allows you to write/run applications that can utilize these four boxes concurrently • This is a simple way to achieve concurrency across computer systems • Everybody has heard of “clusters” by now • They are basically like the above example and can be purchased already built from vendors • We will talk about this kind of concurrent platform at length during this class
Multiple Boxes Together • Why do we use multiple boxes? • Every programmer would rather have an SMP/multi-core architecture that provides all the power/memory she/he needs • The problem is that single boxes do not scale to meet the needs of many scientific applications • Can’t have enough processors or enough powerful enough cores • Can’t have enough memory • But if you can live with a single box, do it! • We will see that single-box programming is much easier than multi-box programming
Where does this leave us? • So far we have seen many ways in which concurrency can be achieved/implemented in computer systems • Within a box • Across boxes • So we could look at a system and just list all the ways in which it does concurrency • It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all (past and present) systems • Provides simple names that everybody can use and understand quickly
Taxonomy of parallel machines? • It’s not going to happen • Up until last year Gordon Bell and Jim Gray published an article in Comm. of the ACM, discussing what the taxonomy should be • Dongarra, Sterling, etc. answered telling them they were wrong and saying what the taxonomy should be, and proposing a new multi-dimensional scheme! • Both papers agree that most terms are conflated, misused, etc. (MPP) • Complicated by the fact that concurrency appears at so many levels • Example: A 16-node cluster, where each node is a 4-way multi-processor, where each processor is hyperthreaded, has vector units, and is fully pipelined with multiple, pipelined functional units
Taxonomy of platforms? • We’ll look at one traditional taxonomy • We’ll look at current categorizations from Top500 • We’ll look at examples of platforms • We’ll look at interesting/noteworthy architectural features that one should know as part of one’s parallel computing culture
The Flynn taxonomy • Proposed in 1966!!! • Functional taxonomy based on the notion of streams of information: data and instructions • Platforms are classified according to whether they have a single (S) or multiple (M) stream of each of the above • Four possibilities • SISD (sequential machine) • SIMD • MIMD • MISD (rare, no commercial system... systolic arrays)
SIMD single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element • PEs can be deactivated and activated on-the-fly • Vector processing (e.g., vector add) is easy to implement on SIMD • Debate: is a vector processor an SIMD machine? • often confused • strictly not true according to the taxonomy (it’s really SISD with pipelined operations) • but it’s convenient to think of the two as equivalent
MIMD • Most general category • Pretty much every supercomputer in existence today is a MIMD machine at some level • This limits the usefulness of the taxonomy • But you had to have heard of it at least once because people keep referring to it, somehow... • Other taxonomies have been proposed, none very satisfying • Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway
A host of parallel machines • There are (have been) many kinds of parallel machines • For the last 12 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500 • It is a good source of information about what machines are (were) and how they have evolved • Note that it’s really about “supercomputers” http://www.top500.org
LINPACK Benchmark? • LINPACK: “LINear algebra PACKage” • A FORTRAN • Matrix multiply, LU/QR/Choleski factorizations, eigensolvers, SVD, etc. • LINPACK Benchmark • Dense linear system solve with LU factorization • 2/3 n3 + O(n2) • Measure: MFlops • The problem size can be chosen • You have to report the best performance for the best n, and the n that achieves half of the best performance.
SIMD Machines • ILLIAC-IV, TMC CM-1, MasPar MP-1 • Expensive logic for CU, but there is only one • Cheap logic for PEs and there can be a lot of them • 32 procs on 1 chip of the MasPar, 1024-proc system with 32 chips that fit on a single board! • 65,536 processors for the CM-1 • Thinking Machine’s gimmick was that the human brain consists of many simple neurons that are turned on and off, and so was their machine • CM-5 • hybrid SIMD and MIMD • Death • Machines not popular, but the programming model is. • Vector processors often labeled SIMD because that’s in effect what they do, but they are not SIMD machines • Led to the MPP terminology (Massively Parallel Processor) • Ironic because none of today’s “MPPs” are SIMD
Clusters, Constellations, MPPs P1 NI P0 NI Pn NI memory memory memory . . . interconnect • These are the only 3 categories today in the Top500 • They all belong to the Distributed Memory model (MIMD) (with many twists) • Each processor/node has its own memory and cache but cannot directly access another processor’s memory. • nodes may be SMPs • Each “node” has a network interface (NI) for all communication and synchronization. • So what are these 3 categories?
Clusters • 80% of the Top500 machines are labeled as “clusters” • Definition: Parallel computer system comprising an integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes • A commodity cluster is one in which both the network and the compute nodes are available in the market • In the Top500, “cluster” means “commodity cluster” • A well-known type of commodity clusters are “Beowulf-class PC clusters”, or “Beowulfs”
What is Beowulf? • An experiment in parallel computing systems • Established vision of low cost, high end computing, with public domain software (and led to software development) • Tutorials and book for best practice on how to build such platforms • Today by Beowulf cluster one means a commodity cluster that runs Linux and GNU-type software • Project initiated by T. Sterling and D. Becker at NASA in 1994
Constellations??? • Commodity clusters that differ from the previous ones by the dominant level of parallelism • Clusters consist of nodes, and nodes are typically SMPs • If there are more procs in a node than nodes in the cluster, then we have a constellation • Typically, constellations are space-shared among users, with each user running openMP on a node, although an app could run on the whole machine using MPI/openMP • To be honest, this term is not very useful and not very used.
MPP???????? • Probably the most imprecise term for describing a machine (isn’t a 256-node cluster of 4-way SMPs massively parallel?) • May use proprietary networks, vector processors, as opposed to commodity component • Cray T3E, Cray X1, and Earth Simulator are distributed memory machines, but the nodes are SMPs. • Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top500. • Let’s look at these “non-commodity” things • People’s definition of “commodity” varies
Vector Processors • Vector architectures were based on a single processor • Multiple functional units • All performing the same operation • Instructions may specify large amounts of parallelism (e.g., 64-way) but hardware executes only a subset in parallel • Historically important • Overtaken by MPPs in the 90s as seen in Top500 • Re-emerging in recent years • At a large scale in the Earth Simulator (NEC SX6) and Cray X1 • At a small scale in SIMD media extensions to microprocessors • SSE, SSE2 (Intel: Pentium/IA64) • Altivec (IBM/Motorola/Apple: PowerPC) • VIS (Sun: Sparc) • Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to
Vector Processors • Advantages • quick fetch and decode of a single instruction for multiple operations • the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion • The compiler does the work for you of course • Memory-to-memory • no registers • can process very long vectors, but startup time is large • appeared in the 70s and died in the 80s • Cray, Fujitsu, Hitachi, NEC