1 / 39

Parallel platforms, etc.

Parallel platforms, etc. Dr. Marco Antonio Ramos Corchado. Where used parallel computing. Where used parallel computing. Taxonomy of platforms?. It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all past and present systems

havard
Download Presentation

Parallel platforms, etc.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel platforms, etc. Dr. Marco Antonio Ramos Corchado

  2. Where used parallel computing

  3. Where used parallel computing

  4. Taxonomy of platforms? • It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all past and present systems • But it’s not going to happen • Up until last year Gordon Bell and Jim Gray published an article in Comm. of the ACM, discussing what the taxonomy should be • Dongarra, Sterling, etc. answered telling them they were wrong and saying what the taxonomy should be, and proposing a new multi-dimensional scheme! • Both papers agree that terms are conflated, misused, etc. (MPP) • We’ll look at one traditional taxonomy • We’ll look at current categorizations from Top500 • We’ll look at examples of platforms • We’ll look at interesting/noteworthy architectural features that one should know as part of one’s parallel computing culture • What about conceptual models of parallel machines?

  5. The Flynn taxonomy • Proposed in 1966!!! • Functional taxonomy based on the notion of streams of information: data and instructions • Platforms are classified according to whether they have a single (S) or multiple (M) stream of each of the above • Four possibilities • SISD (sequential machine) • SIMD • MIMD • MISD (rare, no commercial system... systolic arrays)

  6. SIMD single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element • PEs can be deactivated and activated on-the-fly • Vector processing (e.g., vector add) is easy to implement on SIMD • Debate: is a vector processor an SIMD machine? • often confused • strictly not true according to the taxonomy (it’s really SISD with pipelined operations) • more later on vector processors

  7. MIMD • Most general category • Pretty much everything in existence today is a MIMD machine at some level • This limits the usefulness of the taxonomy • But you had to have heard of it at least once because people keep referring to it, somehow... • Other taxonomies have been proposed, none very satisfying • Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway

  8. A host of parallel machines • There are (have been) many kinds of parallel machines • For the last 11 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500 • It is a good source of information about what machines are (were) and how they have evolved http://www.top500.org

  9. What is the LINPACK Benchmark • LINPACK: “LINear algebra PACKage” • A FORTRAN • Matrix multiply, LU/QR/Choleski factorizations, eigensolvers, SVD, etc. • LINPACK Benchmark • Dense linear system solve with LU factorization • 2/3 n^3 + O(n^2) • Measure: MFlops • The problem size can be chosen • You have to report the best performance for the best n, and the n that achieves half of the best performance.

  10. What can we find on the Top500?

  11. Pies

  12. Platform Architectures SIMD Cluster Vector Constellation SMP MPP

  13. SIMD • ILLIAC-IV, TMC CM-1, MasPar MP-1 • Expensive logic for CU, but there is only one • Cheap logic for PEs and there can be a lot of them • 32 procs on 1 chip of the MasPar, 1024-proc system with 32 chips that fit on a single board! • 65,536 processors for the CM-1 • Thinking Machine’s gimmick was that the human brain consists of many simple neurons that are turned on and off, and so was their machine • CM-5 • hybrid SIMD and MIMD • Death • Machines not popular, but the programming model is. • Vector processors often labeled SIMD because that’s in effect what they do, but they are not SIMD machines • Led to the MPP terminology (Massively Parallel Processor) • Ironic because none of today’s “MPPs” are SIMD

  14. SMPs P2 P1 Pn $ $ $ network/bus memory • “Symmetric MultiProcessors” (often mislabeled as “Shared-Memory Processors”, which has now become tolerated) • Processors all connected to a (large) memory • UMA: Uniform Memory Access, makes is easy to program • Symmetric: all memory is equally close to all processors • Difficult to scale to many processors (<32 typically) • Cache Coherence via “snoopy caches”

  15. Distributed Shared Memory • Memory is logically shared, but physically distributed in banks • Any processor can access any address in memory • Cache lines (or pages) are passed around the machine • Cache coherence: Distributed Directories • NUMA: Non-Uniform Memory Access (some processors may be closer to some banks) • SGI Origin2000 is a canonical example • Scales to 100s of processors • Hypercube topology for the memory (later) P2 P1 Pn $ $ $ memory memory network memory memory

  16. Clusters, Constellations, MPPs P1 NI P0 NI Pn NI memory memory memory . . . interconnect • These are the only 3 categories today in the Top500 • They all belong to the Distributed Memory model (MIMD) (with many twists) • Each processor/node has its own memory and cache but cannot directly access another processor’s memory. • nodes may be SMPs • Each “node” has a network interface (NI) for all communication and synchronization. • So what are these 3 categories?

  17. Clusters • 58.2% of the Top500 machines are labeled as “clusters” • Definition: Parallel computer system comprising an integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes • A commodity cluster is one in which both the network and the compute nodes are available in the market • In the Top500, “cluster” means “commodity cluster” • A well-known type of commodity clusters are “Beowulf-class PC clusters”, or “Beowulfs”

  18. What is Beowulf? • An experiment in parallel computing systems • Established vision of low cost, high end computing, with public domain software (and led to software development) • Tutorials and book for best practice on how to build such platforms • Today by Beowulf cluster one means a commodity cluster that runs Linux and GNU-type software • Project initiated by T. Sterling and D. Becker at NASA in 1994

  19. Constellations??? • Commodity clusters that differ from the previous ones by the dominant level of parallelism • Clusters consist of nodes, and nodes are typically SMPs • If there are more procs in an node than nodes in the cluster, then we have a constellation • Typically, constellations are space-shared among users, with each user running openMP on a node, although an app could run on the whole machine using MPI/openMP • To be honest, this term is not very useful and not very used.

  20. MPP???????? • Probably the most imprecise term for describing a machine (isn’t a 256-node cluster of 4-way SMPs massively parallel?) • May use proprietary networks, vector processors, as opposed to commodity component • IBM SP2, Cray T3E, IBM SP-4 (DataStar), Cray X1, and Earth Simulator are distributed memory machines, but the nodes are SMPs. • Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top500. • Let’s look at these “non-commodity” things

  21. Vector Processors • Vector architectures were based on a single processor • Multiple functional units • All performing the same operation • Instructions may specify large amounts of parallelism (e.g., 64-way) but hardware executes only a subset in parallel • Historically important • Overtaken by MPPs in the 90s as seen in Top500 • Re-emerging in recent years • At a large scale in the Earth Simulator (NEC SX6) and Cray X1 • At a small scale in SIMD media extensions to microprocessors • SSE, SSE2 (Intel: Pentium/IA64) • Altivec (IBM/Motorola/Apple: PowerPC) • VIS (Sun: Sparc) • Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

  22. Vector Processors … … … … … vr1 vr2 vr1 vr3 vr2 + + + + + + + • Definition: a processor that can do elt-wise operations on entire vectors with a single instruction, called a vector instruction • These are specified as operations on vector registers • A processor comes with some number of such registers • A vector register holds ~32-64 elts • The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 • The hardware performs a full vector operation in • #elements-per-vector-register / #pipes r1 r2 + (logically, performs #elts adds in parallel) r3 (actually, performs #pipes adds in parallel)

  23. Vector Processors • Advantages • quick fetch and decode of a single instruction for multiple operations • the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion • The compiler does the work for you of course • Memory-to-memory • no registers • can process very long vectors, but startup time is large • appeared in the 70s and died in the 80s • Cray, Fujitsu, Hitachi, NEC

  24. Global Address Space P1 NI P0 NI Pn NI memory memory memory . . . interconnect • Cray T3D, T3E, X1, and HP Alphaserver cluster • Network interface supports “Remote Direct Memory Access” • NI can directly access memory without interrupting the CPU • One processor can read/write memory with one-sided operations (put/get) • Not just a load/store as on a shared memory machine • Remote data is typically not cached locally • (remember the MPI-2 extension)

  25. Cray X1: Parallel Vector Architecture Cray combines several technologies in the X1 • 12.8 Gflop/s Vector processors (MSP) • Shared caches (unusual on earlier vector machines) • 4 processor nodes sharing up to 64 GB of memory • Single System Image to 4096 Processors • Remote put/get between nodes (faster than MPI)

  26. 51 GB/s Cray X1: the MSP • Cray X1 building block is the MSP • Multi-Streaming vector Processor • 4 SSPs (each a 2-pipe vector processor) • Compiler will (try to) vectorize/parallelize across the MSP, achieving “streaming” custom blocks 12.8 Gflops (64 bit) S S S S 25.6 Gflops (32 bit) V V V V V V V V 25-41 GB/s 0.5 MB $ 0.5 MB $ 0.5 MB $ 0.5 MB $ shared caches 2 MB Ecache At frequency of 400/800 MHz To local memory and network: 25.6 GB/s 12.8 - 20.5 GB/s Figure source J. Levesque, Cray

  27. Cray X1: A node P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M M M M M M M M M M mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem IO IO • Shared memory • 32 network links and four I/O links per node

  28. Cray X1: 32 nodes R R R R R R R R Fast Switch

  29. Cray X1: 128 nodes

  30. Cray X1: Parallelism • Many levels of parallelism • Within a processor: vectorization • Within an MSP: streaming • Within a node: shared memory • Across nodes: message passing • Some are automated by the compiler, some require work by the programmer • Hard to fit the machine into a simple taxonomy • Similar story for the Earth Simulator

  31. The Earth Simulator (NEC) • Each node: • Shared memory (16GB) • 8 vector processors + I/O processor • 640 nodes fully-connected by a 640x640 crossbar switch • Total: 5120 8GFlop processors -> 40GFlop peak

  32. DataStar • 8-way or 32-way Power4 SMP nodes • Connected via IBM’s Federation (formerly Colony) interconnect • 8-ary Fat-tree topology • 1,632 processors • 10.4 TeraFlops • Each node is directly connected via fiber to IBM’s GPFS (parallel file system) • Similar to the SP-x series, but higher bandwidth and higher arity of the fat-tree

  33. Blue Gene/L • 65,536 processors (still being assembled) • Relatively modest clock rates, so that power consumption is low, cooling is easy, and space is small (1024 nodes in the same rack) • Besides, processor speed is on par with the memory speed so faster does not help • 2-way SMP nodes! • several networks • 64x32x32 3-D torus for point-to-point • tree for collective operations and for I/O • plus other Ethernet, etc.

  34. If you like dead Supercomputers • Lots of old supercomputers w/ pictures • http://www.geocities.com/Athens/6270/superp.html • Dead Supercomputers • http://www.paralogos.com/DeadSuper/Projects.html • e-Bay • Cray Y-MP/C90, 1993 • $45,100.70 • From the Pittsburgh Supercomputer Center who wanted to get rid of it to make space in their machine room • Original cost: $35,000,000 • Weight: 30 tons • Cost $400,000 to make it work at the buyer’s ranch in Northern California

  35. Network Topologies • People have experimented with different topologies for distributed memory machines, or to arrange memory banks in NUMA shared-memory machines • Examples include: • Ring: KSR (1991) • 2-D grid: Intel Paragon (1992) • Torus • Hypercube: nCube, Intel iPSC/860, used in the SGI Origin 2000 for memory • Fat-tree: IBM Colony and Federation Interconnects (SP-x) • Arrangement of switches • pioneered with “Butterfly networks” like in the BBN TC2000 in the early 1990 • 200 MHz processors in a multi-stage network of switches • Virtually Shared Distributed memory (NUMA) • I actually worked with that one!

  36. Hypercube • Defined by its dimension, d 1D 2D 3D 4D

  37. Hypercube • Properties • Has 2d nodes • The number of hops between two nodes is at most d • The diameter of the network grows logarithmically with the number of nodes, which was the key for interest in hypercubes • But each node needs d neighbors, which is a problem • Routing and Addressing 1111 1110 0110 0111 • d-bit address • routing from xxxx to yyyy: just keep going to a neighbor that has a smaller hamming distance • reminiscent of some p2p things • TONS of Hypercube research (even today!!) 0010 0011 1010 1011 0101 1101 0100 1100 1001 1000 0000 0001

  38. Systolic Array? • Array of processors in some topology with each processor having a few neighbors • typically 1-D linear array or 2-D grid • Processors perform regular sequences of operations among data that flow between them • e.g. receive from my left and top neighbor, compute, pass to my right and bottom neighbor • Like SIMD machines, everything happens in locked step • Example: CMU’s iWarp by Intel (1988 or so) • Allows for convenient algorithms for some problems • Today: used in FPGA systems that build systolic arrays to run a few algorithms. • regular computations (matrix multiply) • genetic algorithms • Impact: allows us to reason about algorithms

  39. Models for Parallel Computation • We have seen broad taxonomies of machines, examples of machines, techniques to program them (OpenMP, MPI, etc.) • At this point, how does one reason about parallel algorithms, about their complexity, about their design, etc.? • What one needs is abstract models of parallel platforms • Some are really abstract • Some are directly inspired from actual machines • Although these machines may no longer exist or be viable, the algorithms can be implemented on more relevant architectures, or at least give us clues • e.g.: Matrix multiply on a systolic array helps doing matrix multiply on a logical 2-D grid topology that sits on top of a cluster of workstations. • PRAM, Sorting networks, systolic arrays, etc.

More Related