190 likes | 203 Views
Learn about constructing a system with multiple computers or processors and the basics of parallel programming. Explore shared memory multiprocessors and distributed memory multicomputers, as well as programming alternatives and software tools for clusters.
E N D
Constructing a system with multiple computers or processors ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2014. slides1-b.ppt Jan 13, 2014
Conventional Computer Consists of a processor executing a program stored in a (main) memory: Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address. Main memory Instr uctions (to processor) Data (to or from processor) Processor
Types of Parallel Computers Two principal approaches: • Shared memory multiprocessor • Distributed memory multicomputer
1. Shared Memory Multiprocessor System Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module: Memory module One address space Processor-memory Interconnections Processors
Using a processor-memory bus as the interconnection networkExample – Dual and Quad Processor Shared Memory Multiprocessors Processor Processor Processor Processor L1 cache L1 cache L1 cache L1 cache L2 Cache L2 Cache L2 Cache L2 Cache Bus interface Bus interface Bus interface Bus interface Processor/ memory b us Set of lines 100+ coit-grid01 – coit-grid04 are of this form (each dual processor servers with 8GB shared memory) Memory controller Memory Shared memory
Dual-core and multi-core processors“Recent” innovation (since 2005) • Two or more independent processors in one integrated circuit package (chip) • Actually an old idea but not put into wide practice until recently because limits of making single processors faster principally caused by: • Power dissipation (power wall) and clock frequency limitations • Memory speed limitations (memory wall) • Limits in parallelism within a single instruction stream (instruction parallelism wall)
Single quad core shared memory multiprocessor Chip Processor Processor Processor Processor Processor “core” L1 cache L1 cache L1 cache L1 cache L2 Cache Memory controller Memory Shared memory
Multiple quad-core multiprocessors Core Core Core Core Core Core Core Core Core Core Core Core L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L2 Cache L2 Cache possible L3 cache Memory controller Shared memory Examples • cci-grid05.uncc.edu - four processors each quad core. 16 cores total. All 16 cores have access to 64 GB shared main memory (thro multilevel caches) • cci-grid09.uncc.edu - two processors, each 16 core. 32 cores total. 3 levels of caches. Memory
Programming Shared Memory Multiprocessors Several possible ways – Usual approach is to use threads Threads - individual parallel sequences (threads), each thread having their own local variables but being able to access shared variables declared outside threads. 1. Low–level thread libraries - programmer calls thread routines to create and control the threads. Example Pthreads, Java threads. 2. Higher level library functions and preprocessor compiler directives. Example OpenMP - industry standard. Consists of library functions, compiler directives, and environment variables
Other programming alternatives • Parallelizing compilers compiling regular sequential programs and making them parallel programs • Special parallel languages (both not now common). Tasks Rather than program with threads, which are closely linked to the physical hardware, can program with parallel “tasks.” Promoted by Intel with their TBB (Thread Building Blocks) tools.
2. Distributed Memory Multicomputer Complete computers connected through an interconnection network: Many interconnection networks explored in 1970s and 1980s including 2- and 3-dimensional meshes, hypercubes, and multistage interconnection networks Interconnection network Messages Processor Local memory Computers
Networked Computers as a Computing Platform • Became a very attractive alternative to expensive supercomputers and parallel computer systems for high-performance computing in early 1990s. • Several early projects. Notable: NASA Beowulf project. • “Beowulf “cluster -- A group of interconnected “commodity” computers achieving high performance with low cost. Typically using commodity interconnects - high speed Ethernet, and Linux OS.
Key advantages of using commodity networked computers: • Very high performance workstations and PCs readily available at low cost. • Latest processors can easily be incorporated into the system as they become available. • Existing software can be used or modified.
Cluster Interconnects • Originally fast Ethernet on low cost clusters • Gigabit Ethernet - easy upgrade path More specialized/higher performance interconnects available including Myrinet and Infiniband.
Dedicated cluster with a master node and compute nodes User Computers Dedicated Cluster External network Ethernet interface Master node Switch Local network Compute nodes
Software Tools for Clusters • Based upon message passing programming model • User-level libraries provided for explicitly specifying messages to be sent between executing processes on each computer . • Use with regular programming languages (C, C++, ...). • Can be quite difficult to program correctly as we shall see.
Using GPUs for High Performance Computing • GPUs (graphics processing units) originally designed to speed up and support graphics operations • Now also used for high performance computing. • GPUs now have 100’s or 1000’s of processing cores and provide orders of magnitude increase in execution speed. We will look at GPU devices and how to program them in the last few weeks of the course
GPU clusters K20x GPUs • Recent trend for clusters – incorporating GPUs for high performance. • Many of the fastest computers in the world are GPU clusters UNC-C cluster used in course has three GPU servers: • coit-grid06.uncc.edu (C2050 GPU) • coit-grid07.uncc.edu (C2050 GPU) • coit-grid08.uncc.edu (K20 GPU) C2050 GPUs http://www.top500.org/
Next step • Learn how to program multiprocessor systems • We will start with a new pattern programming approach and later consider lower level tools.