1 / 69

CS 491 Parallel & Distributed Computing

CS 491 Parallel & Distributed Computing. Welcome to CS 491. Instructor Dan Stevenson Office: P 136 stevende@uwec.edu Course Web Site: http://www.cs.uwec.edu /~stevende/cs491/. Getting Help: When you have questions. Regarding HELP with course materials and assignments

finley
Download Presentation

CS 491 Parallel & Distributed Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 491Parallel & Distributed Computing

  2. Welcome to CS 491 • Instructor • Dan Stevenson • Office: P 136 • stevende@uwec.edu • Course Web Site: • http://www.cs.uwec.edu/~stevende/cs491/

  3. Getting Help:When you have questions • Regarding HELP with course materials and assignments • Come to office hours – Phillips 136 • TIME TBD (Check website) • OR by appointment (just e-mail or call my office) • Send me an e-mail: stevende@uwec.edu

  4. Textbooks • Required: • Michael J. Quinn • Parallel Programming in C with OpenMP and MPI • Suggested: • Web tutorials during the semester

  5. Overall Course Grading

  6. CS 491:Overview Slide content for this course collected from various sources, including: • Dr. Henry Neeman, University of Oklahoma • Dr. Libby Shoop, Macalester College • Dr. Charlie Peck, Earlham College • Tom Murphy, Contra Costa College • Dr. Robert Panoff, Shodor Foundation • Others, who will be credited…

  7. “Parallel & Distributed Computing” • What does it mean to you? • Coordinating Threads • Supercomputing • Multi-core Processors • Beowulf Clusters • Cloud Computing • Grid Computing • Client-Server • Scientific Computing • All contexts for “splitting up work” in an explicit way CS 491 – Parallel and Distributed Computing

  8. CS 491 • In this course, we will take mostly from the context of “Supercomputing” • This is the field with the longest record of parallel computing expertise. • It also has a long record of being a source for “trickle-down” technology. CS 491 – Parallel and Distributed Computing

  9. What is Supercomputing? • Supercomputing is the biggest, fastest computing - right this minute. • Likewise, a supercomputer is one of the biggest, fastest computers right this minute. • The definition of supercomputing is, therefore, constantly changing. • A Rule of Thumb: A supercomputer is typically at least 100 times as powerful as a PC. • Jargon: Supercomputing is also known as High Performance Computing (HPC) or High End Computing (HEC) or Cyberinfrastructure (CI).

  10. Fastest Supercomputer vs. Moore GFLOPs: billions of calculations per second Over recent years, supercomputers have benefitted directly from microprocessor performance gains, and have also gotten better at coordinating their efforts.

  11. Recent Champion • Jaguar – Oak Ridge National Laboratory (TN) • 224162 processor cores – 1.76 PetaFLOP/second CS 491 – Parallel and Distributed Computing

  12. Current Champ • 2008 IBM Roadrunner: 1.1Petaflops • 2009 Cray Jaguar: 1.6 • 2010 Tiahe-1A (China): 2.6 • 2011 Fujitsu K (Japan): 10.5 • 88,128 8-core processors -> 705,024 cores • Needs power equivalent to 10,000 homes • Linpack numbers • Core i7 – 2.3 Gflops • Glalaxy Nexus – 97 Mflops CS 491 – Parallel and Distributed Computing

  13. Hold the Phone • Why should we care? • What useful thing actually takes a long time to run anymore? (especially long enough to warrant investing 7/8/9 figures on a supercomputer) • Important: It’s usually not about getting something done faster, but about getting a harder thing done in the same amount of time • This is often referred to as capability computing CS 491 – Parallel and Distributed Computing

  14. Tornadic Storm What Is HPC Used For? • Simulation of physical phenomena, such as • Weather forecasting • Galaxy formation • Oil reservoir management • Data mining: finding needles of information in a haystack ofdata, such as: • Gene sequencing • Signal processing • Detecting storms that might produce tornados (want forecasting, not retrocasting…) • Visualization: turning a vast sea of data into pictures that a scientist can understand • Oak Ridge National Lab has a 512-core cluster devoted entirely to visualization runs

  15. CS 491 – Parallel and Distributed Computing

  16. What is Supercomputing About? Size Speed (Laptop)

  17. What is Supercomputing About? • Size: Many problems that are interesting™ can’t fit on a PC – usually because they need more than a few GB of RAM, or more than a few 100 GB of disk. • Speed: Many problems that are interesting™ would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer.

  18. Supercomputing Issues • Parallelism: doing multiple things at the same time • finding and coordinating this can be challenging • The tyranny of the storage hierarchy • The hardware you’re running on matters • Moving data around is often more expensive than actually computing something

  19. Parallel Computing Hardware CS 491 – Parallel and Distributed Computing

  20. Parallel Processing • The term parallel processing is usually reserved for the situation in which a single task is executed on multiple processors • Discounts the idea of simply running separate tasks on separate processors – a common thing to do to get high throughput, but not really parallel processing Key questions in hardware design: • How do parallel processors share data and communicate? • shared memory vs distributed memory • How are the processors connected? • single bus vs network • The number of processors is determined by a combination of #1 and #2

  21. How is Data Shared? • Shared Memory Systems • All processors share one memory address space and can access it • Information sharing is often implicit • Distributed Memory Systems (AKA “Message Passing Systems”) • Each processor has its own memory space • All data sharing is done via programming primitives to pass messages • i.e. “Send data value to processor 3” • Information sharing is always explicit

  22. Message Passing • Processors communicate via messages that they send to each other: send and receive • This form is required for multiprocessors that have separate private memories for each processor • Cray T3E • “Beowolf Cluster” • SETI@HOME • Note: shared memory multiprocessors can also have separate memories – they just aren’t “private” to each processor

  23. Shared Memory Systems • Processors all operate independently, but operate out of the same logical memory. • Data structures can be read by any of the processors • To properly maintain ordering in our programs, synchronization primitives are needed! (locks/semaphores)

  24. Connecting Multiprocessors

  25. Single Bus Multiprocessor • Connect several processors via a single shared bus • bus bandwidth limits the number of processors • local cache lowers bus traffic • single memory module attached to the bus • Limited to very small systems! • Intel processors support this mode by default

  26. The Cache Coherence Problem

  27. Cache Coherence Solutions • Two most common variations: • “snoopy” schemes • rely on broadcast to observe all coherence traffic • well suited for buses and small-scale systems • example: SGI Challenge or Intel x86 • directory schemes • uses centralized information to avoid broadcast • scales well to large numbers of processors • example: SGI Origin/Altix

  28. Snoopy Cache Coherence Schemes • Basic Idea: • all coherence-related activity is broadcast to all processors • e.g., on a global bus • each processor monitors (aka “snoops”) these actions and reacts to any which are relevant to the current contents of its cache • examples: • if another processor wishes to write to a line, you may need to “invalidate” (i.e. discard) the copy in your own cache • if another processor wishes to read a line for which you have a dirty copy, you may need to supply it • Most common approach in commercial shared-memory multiprocessors. • Protocol is a distributed algorithm: cooperating state machines • Set of states, state transition diagram, actions

  29. Network Connected Multiprocessors • In the single bus case, the bus is used for every main memory access • In the network connected model, the network is used only for inter-process communication • There are multiple “memories” BUT that doesn’t mean that there’s separate memory spaces

  30. Directory Coherence • Network-based machines do not want to use a snooping coherence protocol! • Means that every memory transaction would need to be sent everywhere! • Directory-based systems use a global “Directory” to arbitrate who owns data • Point-to-point communication with the directory instead of bus broadcasts • The directory keeps a list of what caches have the data in question • When a write to that data occurs, all of the affected caches can be notified directly

  31. Network Topologies: Ring • Each node (processor) contains its own local memory • Each node is connected to the network via a switch • Messages hop along the ring from node to node until they reach the proper destination

  32. Network Topologies: 2D Mesh • 2D grid, or mesh, of nodes • Each “inside” node has 4 neighbors • “outside” nodes only have 2 • If all nodes have four neighbors, then this is a 2D torus

  33. Network Topologies: Hypercube • Also called an n-cube • For n=2  2D cube (4 nodes  square) • For n=3  3D cube (8 nodes) • For n=4  4D cube (16 nodes) • In an n cube, all nodes have n neighbors 3 cube 4 cube

  34. Network Topologies: Full Crossbar • Every node can communicate directly with every other node in only one pass  fully connected network • n nodes  n2 switches • Therefore, extremely expensive to implement!

  35. Network Topologies: Butterfly Network Omega network switch box • Fully connected, but requires passes thru multiple switch boxes • Less hardware required than crossbar, but contention can occur

  36. Flynn’s Taxonomy of Computer Systems (1966) A simple model for categorizing computers: 4 categories: • SISD – Single Instruction Single Data • the standard uniprocessor model • SIMD – Single Instruction Multiple Data • Full systems that are “true” SIMD are no longer in use • Many of the concepts exist in vector processing and to come extend graphics cards • MISD – Multiple Instruction Single Data • doesn’t really make sense • MIMD – Multiple Instruction Multiple Data • the most common model in use

  37. “True” SIMD • A single instruction is applied to multiple data elements in parallel – same operation on all elements at the same time • Most well known examples are: • Thinking Machines CM-1 and CM-2 • MasPar MP-1 and MP-2 • others • All are out of existence now • SIMD requires massive data parallelism • Usually have LOTS of very very simple processors (e.g. 8-bit CPUs)

  38. Vector Processors • Closely related to SIMD • Cray J90, Cray T90, Cray SV1, NEC SX-6 • Starting to “merge” with MIMD systems • Cray X1E and upcoming systems (“Cascade”) • Use a single instruction to operate on an entire vector of data • Difference from “True” SIMD is that data in a vector processor is not operated on in true parallel, but rather in a pipeline • Uses “vector registers” to feed a pipeline for the vector operation • Generally have memory systems optimized for “streaming” of large amounts of consecutive or strided data • (Because of this, didn’t typically have caches until late 90s)

  39. MIMD • Multiple instructions are applied to multiple data • The multiple instructions can come from the same program, or from different programs • Generally “parallel processing” implies the first • Most modern multiprocessors are of this form • IBM Blue Gene, Cray T3D/T3E/XT3/4/5, SGI Origin/Altix • Clusters

  40. Parallel Computing Hardware “Supercomputer Edition” CS 491 – Parallel and Distributed Computing

  41. The Most Common Supercomputer: Clustering • A parallel computer built out of commodity hardware components • PCs or server racks • Commodity network (like ethernet) • Often running a free-software OS like Linux with a low-level software library to facilitate multiprocessing • Use software to send messages between machines • Standard is to use MPI (message passing interface)

  42. What is a Cluster? “… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is ... is freedom.” – Captain Jack Sparrow “Pirates of the Caribbean”

  43. What a Cluster is …. • A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network • It also needs software that allows the nodes to communicate over the interconnect. • But what a cluster is … is all of these components working together as if they’re one big computer (a supercomputer)

  44. What a Cluster is …. • nodes • PCs • Server rack nodes • interconnection network • Ethernet (“GigE”) • Myrinet (“10GigE”) • Infiniband (low latency) • The Internet (not really – typically called “Grid”) • software • OS • Generally Linux • Redhat / CentOS / SuSE • Windows HPC Server • Libraries (MPICH, PBLAS, MKL, NAG) • Tools (Torque/Maui, Ganglia, GridEngine)

  45. An Actual (Production) Cluster Interconnect Nodes

  46. Other Actual Clusters… CS 491 – Parallel and Distributed Computing

  47. What a Cluster is NOT… • At the high end, many supercomputers are made with custom parts • Custom backplane/network • Custom/Reconfigurable processors • Extreme Custom cooling • Custom memory system • Examples: • IBM Blue Gene • Cray XT4/5/6 • SGI Altix CS 491 – Parallel and Distributed Computing

  48. Moore’s Law

  49. Moore’s Law • In 1965, Gordon Moore was an engineer at Fairchild Semiconductor. • He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months. • It turns out that computer speed was roughly proportional to the number of transistors per unit area. • Moore wrote a paper about this concept, which became known as “Moore’s Law.”

  50. Fastest Supercomputer vs. Moore GFLOPs: billions of calculations per second

More Related