590 likes | 919 Views
Computer architecture II. Introduction. Today’s overview. Why parallel computing? Technology trends Processors Storage Architectural Application trends Challenging computational problems What is a parallel computer? Classical parallel computer classifications Architecture
E N D
Computer architecture II Introduction Computer Architecture II
Today’s overview • Why parallel computing? • Technology trends • Processors • Storage • Architectural • Application trends • Challenging computational problems • What is a parallel computer? • Classical parallel computer classifications • Architecture • Memory access • Cluster and grid computing (definitions) • Top 500 • Parallel architectures and their convergence Computer Architecture II
Units of Measure in HPC • High Performance Computing (HPC) units are: • Flops: floating point operations • Flop/s: floating point operations per second • Bytes: size of data (a double precision floating point number is 8 bytes long) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576) Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824) Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776) Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624) Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte Computer Architecture II
Why parallel computing? Processor Control unit Arithmetic Logic Unit • Sequential computer • von Neumann model • One processor • One memory • One instruction executed at a time • Fastest machines: a couple of billion of operations per second (GFLOPS) Connecting logic I/O system Memory Computer Architecture II
Tunnel Vision by Experts • “I think there is a world market for maybe five computers.” • Thomas Watson, chairman of IBM, 1943. • “There is no reason for any individual to have a computer in their home” • Ken Olson, president and founder of digital equipment corporation, 1977. • “640K [of memory] ought to be enough for anybody.” • Bill Gates, chairman of Microsoft,1981. Slide source: Warfield et al. Computer Architecture II
Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Microprocessors have become smaller, denser, and more powerful. Slide source: Jack Dongarra Computer Architecture II
Impact of Device Shrinkage • What happens when the transistor size shrinks by a factor of x ? • Clock rate goes up by x because wires are shorter • actually less than x, because of power consumption • Transistors per unit area goes up by x2 • Die size also tends to increase • typically another factor of ~x • Raw computing power of the chip goes up by ~ x4 ! • of which x3 is devoted either to parallelism or locality Computer Architecture II
Microprocessor Transistors per Chip Growth in transistors per chip Increase in clock rate Computer Architecture II
Limiting forces: Increased cost and difficulty of manufacturing Computer Architecture II
How fast can a serial computer be? (James Demmel) 1 Tflop/s, 1 Tbyte sequential machine r = 0.3 mm • Consider the 1 Tflop/s sequential machine: • Data must travel some distance, r, to get from memory to CPU. • To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm. • Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: • Each word occupies about 3 square Angstroms, or the size of a small atom. • No choice but parallelism Computer Architecture II
Storage: Locality and Parallelism Conventional Storage Hierarchy Proc Proc Proc Cache Cache Cache • Large memories are slow, fast memories are small • Storage hierarchies are large and fast on average • Parallel processors, collectively, have large, fast cache ($) • the slow accesses to “remote” data we call “communication” • Algorithm should do most work on local data L2 Cache L2 Cache L2 Cache L3 Cache L3 Cache L3 Cache potential interconnects Memory Memory Memory Computer Architecture II
Processor-DRAM Gap (latency) µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Computer Architecture II
Storage Trends • Divergence between memory capacity and speed even more pronounced • Capacity increased by 1000x from 1980-95, speed only 2x • Gigabit DRAM by c. 2000, but gap with processor speed much greater • Larger memories are slower, while processors get faster • Need to transfer more data in parallel • Need deeper cache hierarchies • How to organize caches? • Parallelism increases effective size of each level of hierarchy, without increasing access time • Disks: Parallel disks plus caching Computer Architecture II
Architectural Trends • Resolve the tradeoff between parallelism and locality • Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect • Tradeoffs may change with scale and technology advances • Understanding microprocessor architectural trends => Helps build intuition about design issues or parallel machines => Shows fundamental role of parallelism even in “sequential” computers Computer Architecture II
Phases in “VLSI” Generation Computer Architecture II
Architectural Trends • Greatest trend in VLSI generation is increase in parallelism • Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit • slows after 32 bit • adoption of 64-bit now under way (Opteron, Itanium), 128-bit far (not performance issue) • Mid 80s to mid 90s: instruction level parallelism (ILP) • pipelining and simple instruction sets, + compiler advances (RISC) • on-chip caches and functional units => superscalar execution • greater sophistication: out of order execution, speculation, prediction • Current step: • thread level parallelism • multicore Computer Architecture II
Pipeline of a superscalar processor In-order In-order Out-of-order
How far will ILP go? • Simulation for discovering the maximum available ILP • Infinite fetch bandwidth • Infinite function units • perfect branch prediction • Cache misses: 0 cycles Computer Architecture II
Multithreaded architectures Computer Architecture II
Multithreaded architectures Examples: Pentium 4 Xeon, Ultrasparc T1 (32 &64 threads) Itanium Montecito (also dualcore) Computer Architecture II
Multi-core • Intel: • Dual Pentium Extreme Edition 840 (first) • Quad Core Xeon 5300 • 80-core chip capable of cranking through 1.28TFlops. • AMD: Dual Core Opteron, Quad Core FX (3GHz) • Sun: Rock: 16 cores (due 2008) • IBM: 2 cores Power6 5GHz Computer Architecture II
Alternative: Cell • general-purpose Power Architecture core of modest performance • coprocessing elements multimedia and vector processing applications • PowerPC core • controls 8 SPE (Synergistic processing elements): SIMD • Cache coherent • 25. GB/s XDR memory controller Computer Architecture II
Alternative: Cell • SPE: register hierarchy • 128x128b single cycle registers • 16kx128b 6 cycles registers • DMA in parallel with SIMD processing Computer Architecture II
Overview of Cell processor Computer Architecture II
Performance (p processors) Performance (1 processor) New Applications Time (1 processor) More Performance Time (p processors) Application Trends • Demand for cycles fuels advances in hardware, and vice-versa • Cycle drives exponential increase in microprocessor performance • Drives parallel architecture harder: most demanding applications • Goal of applications in using parallel machines: Speedup Speedup (p processors) = • For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Computer Architecture II
Improving the speedup of Parallel Applications • AMBER molecular dynamics simulation program • Motion of large biological models (proteins, DNA) • 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D • 9/94: optimize the balance • 8/94: optimize the communication Computer Architecture II
Particularly Challenging Computations • Science • Global climate modeling • Astrophysical modeling • Biology: genomics; protein folding; drug design • Computational Chemistry • Computational Material Sciences and Nanosciences • Engineering • Crash simulation • Semiconductor design • Earthquake and structural modeling • Computation fluid dynamics (airplane design) • Combustion (engine design) • Business • Financial and economic modeling • Transaction processing, web services and search engines • Defense • Nuclear weapons -- test by simulations • Cryptography Computer Architecture II
$5B Market in Technical Computing Source: IDC 2004, from USA´s National Research Council Future of Supercomputer Report Computer Architecture II
Scientific Computing Demand Computer Architecture II
NRC report on Future of Supercomputing • “In climate modeling or plasma physics, there is a broad consensus that up to seven orders of magnitude of performance improvements will be needed to achieve well-defined computational goals.” Computer Architecture II
What is Parallel Architecture? • A parallel computer is a collection of processing elements that cooperate to solve large problems fast • Some broad issues: • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale? Computer Architecture II
Why Study Parallel Architecture? • Role of a computer architect: • To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. • Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing Computer Architecture II
1stArchitecture classification • There are several different methods used to classify computers • No single taxonomy fits all designs • Flynn's taxonomy uses the relationship of program instructions to program data. • SISD - Single Instruction, Single Data Stream • SIMD - Single Instruction, Multiple Data Stream • MISD - Multiple Instruction, Single Data Stream (no practical examples) • MIMD - Multiple Instruction, Multiple Data Stream Computer Architecture II
SISD • One instruction stream • One data stream • One instruction issued on each clock cycle • One instruction executed on one element of data (scalar) at a time • Traditional von Neumann architecture Computer Architecture II
SIMD • Also von Neumann architectures but more powerful instructions • Each instruction may operate on more than one data element • Usually intermediate host executes program logic and broadcasts instructions to other processors • Synchronous (lockstep) • Rating how fast these machines can issue instructions is not a good measure of their performance • Two major types: • Vector SIMD • Parallel SIMD Computer Architecture II
Vector SIMD • Single instruction results in multiple operands being updated • Scalar processing operates on single data elements. Vector processing operates on whole vectors (groups) of data at a time. • Examples: • Cell • Cray 1 • NEC SX-2 • Fujitsu VP • Hitachi S820 Computer Architecture II
Parallel SIMD • Several processors execute the same instruction in lockstep • Each processor modifies a different element of data • Drawback: idle processors • Advantage: no explicit synchronization required • Examples • Connection Machine CM-2 • Maspar MP-1, MP-2 Computer Architecture II
MIMD • Several processors executing different instructions on different data • Advantages: • different jobs can be performed at a time • A better utilization can be achieved • Drawbacks: • Explicit synchronization needed • Difficult to program • Examples • MIMD Accomplished via Parallel SISD machines: Sequent, nCUBE , Intel iPSC/2, IBM RS6000 cluster, ALL CLUSTERS • MIMD Accomplished via Parallel SIMD machines: Cray C 90, Cray 2, NEC SX-3, Fujitsu VP 2000, Convex C-2, Intel Paragon, CM 5, KSR-1, IBM SP1, IBM SP2 Computer Architecture II
2nd Classification: Memory architectures • Shared memory • UMA • NUMA • CC-NUMA • Distributed memory • COMA Computer Architecture II
UMA (Uniform Memory Access) Mk M1 M2 … Interconnect P1 P2 … Pn Computer Architecture II
NUMA (Non Uniform Memory Access) PEn PE2 PE1 Pn P2 P1 … Mn M2 M1 Interconnect Computer Architecture II
CC-NUMA (Cache Coherent NUMA) PE1 PE2 PEn P1 P2 Pn C1 C2 Cn M1 M2 Mn Interconnect Computer Architecture II
Distributed memory PE1 PE2 PEn P1 P2 Pn … M1 M2 Mn Interconnect Computer Architecture II
COMA (Cache Only Machine) PE1 PE2 PEn P1 P2 Pn … C1 C2 Cn Interconnect Computer Architecture II
Memory architecture Logical view shared distributed Very important! shared UMA Physical view NUMA M. Dist. distributed scalability Future! “Easy” Programming Computer Architecture II
Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network Computer Architecture II
Clusters and Cluster Computing • Definition of a cluster: • Communication infrastructure: • High performance networks, faster than traditional LAN ( Myrinet, Infiniband, Gbit Ethernet) • Low latency communication protocols • Loosly coupled compared to traditional proprietary supercomputers (eg. IBM SP, Intel Paragon) A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource. [Buyya98] Computer Architecture II
Cluster architecture Computer Architecture II
Clusters and Cluster Computing • Cluster networks: • Ethernet (10Mbps) (*), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps), ATM, Myrinet (1.2Gbps), Fiber Channel, FDDI, Infiniband, etc. • Cluster projects: • Beowulf (CalTech and NASA) - USA • Condor - Wisconsin State University, USA • DQS (Distributed Queuing System) - Florida State University, USA. • HPVM -(High Performance Virtual Machine),UIUC&now UCSB,USA • far - University of Liverpool, UK • Gardens - Queensland University of Technology, Australia • Kerrighed – INRIA, France • MOSIX - Hebrew University of Jerusalem, Israel • NOW (Network of Workstations) - Berkeley, USA Computer Architecture II