Computer architecture II

Computer architecture II Introduction Computer Architecture II

Today’s overview • Why parallel computing? • Technology trends • Processors • Storage • Architectural • Application trends • Challenging computational problems • What is a parallel computer? • Classical parallel computer classifications • Architecture • Memory access • Cluster and grid computing (definitions) • Top 500 • Parallel architectures and their convergence Computer Architecture II

Units of Measure in HPC • High Performance Computing (HPC) units are: • Flops: floating point operations • Flop/s: floating point operations per second • Bytes: size of data (a double precision floating point number is 8 bytes long) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576) Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824) Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776) Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624) Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte Computer Architecture II

Why parallel computing? Processor Control unit Arithmetic Logic Unit • Sequential computer • von Neumann model • One processor • One memory • One instruction executed at a time • Fastest machines: a couple of billion of operations per second (GFLOPS) Connecting logic I/O system Memory Computer Architecture II

Tunnel Vision by Experts • “I think there is a world market for maybe five computers.” • Thomas Watson, chairman of IBM, 1943. • “There is no reason for any individual to have a computer in their home” • Ken Olson, president and founder of digital equipment corporation, 1977. • “640K [of memory] ought to be enough for anybody.” • Bill Gates, chairman of Microsoft,1981. Slide source: Warfield et al. Computer Architecture II

Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Microprocessors have become smaller, denser, and more powerful. Slide source: Jack Dongarra Computer Architecture II

Computer Architecture II

Impact of Device Shrinkage • What happens when the transistor size shrinks by a factor of x ? • Clock rate goes up by x because wires are shorter • actually less than x, because of power consumption • Transistors per unit area goes up by x2 • Die size also tends to increase • typically another factor of ~x • Raw computing power of the chip goes up by ~ x4 ! • of which x3 is devoted either to parallelism or locality Computer Architecture II

Microprocessor Transistors per Chip Growth in transistors per chip Increase in clock rate Computer Architecture II

Limiting forces: Increased cost and difficulty of manufacturing Computer Architecture II

How fast can a serial computer be? (James Demmel) 1 Tflop/s, 1 Tbyte sequential machine r = 0.3 mm • Consider the 1 Tflop/s sequential machine: • Data must travel some distance, r, to get from memory to CPU. • To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm. • Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: • Each word occupies about 3 square Angstroms, or the size of a small atom. • No choice but parallelism Computer Architecture II

Storage: Locality and Parallelism Conventional Storage Hierarchy Proc Proc Proc Cache Cache Cache • Large memories are slow, fast memories are small • Storage hierarchies are large and fast on average • Parallel processors, collectively, have large, fast cache ($) • the slow accesses to “remote” data we call “communication” • Algorithm should do most work on local data L2 Cache L2 Cache L2 Cache L3 Cache L3 Cache L3 Cache potential interconnects Memory Memory Memory Computer Architecture II

Processor-DRAM Gap (latency) µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Computer Architecture II

Storage Trends • Divergence between memory capacity and speed even more pronounced • Capacity increased by 1000x from 1980-95, speed only 2x • Gigabit DRAM by c. 2000, but gap with processor speed much greater • Larger memories are slower, while processors get faster • Need to transfer more data in parallel • Need deeper cache hierarchies • How to organize caches? • Parallelism increases effective size of each level of hierarchy, without increasing access time • Disks: Parallel disks plus caching Computer Architecture II

Architectural Trends • Resolve the tradeoff between parallelism and locality • Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect • Tradeoffs may change with scale and technology advances • Understanding microprocessor architectural trends => Helps build intuition about design issues or parallel machines => Shows fundamental role of parallelism even in “sequential” computers Computer Architecture II

Phases in “VLSI” Generation Computer Architecture II

Architectural Trends • Greatest trend in VLSI generation is increase in parallelism • Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit • slows after 32 bit • adoption of 64-bit now under way (Opteron, Itanium), 128-bit far (not performance issue) • Mid 80s to mid 90s: instruction level parallelism (ILP) • pipelining and simple instruction sets, + compiler advances (RISC) • on-chip caches and functional units => superscalar execution • greater sophistication: out of order execution, speculation, prediction • Current step: • thread level parallelism • multicore Computer Architecture II

Pipeline of a superscalar processor In-order In-order Out-of-order

How far will ILP go? • Simulation for discovering the maximum available ILP • Infinite fetch bandwidth • Infinite function units • perfect branch prediction • Cache misses: 0 cycles Computer Architecture II

Multithreaded architectures Computer Architecture II

Multithreaded architectures Examples: Pentium 4 Xeon, Ultrasparc T1 (32 &64 threads) Itanium Montecito (also dualcore) Computer Architecture II

Multi-core • Intel: • Dual Pentium Extreme Edition 840 (first) • Quad Core Xeon 5300 • 80-core chip capable of cranking through 1.28TFlops. • AMD: Dual Core Opteron, Quad Core FX (3GHz) • Sun: Rock: 16 cores (due 2008) • IBM: 2 cores Power6 5GHz Computer Architecture II

Alternative: Cell • general-purpose Power Architecture core of modest performance • coprocessing elements multimedia and vector processing applications • PowerPC core • controls 8 SPE (Synergistic processing elements): SIMD • Cache coherent • 25. GB/s XDR memory controller Computer Architecture II

Alternative: Cell • SPE: register hierarchy • 128x128b single cycle registers • 16kx128b 6 cycles registers • DMA in parallel with SIMD processing Computer Architecture II

Overview of Cell processor Computer Architecture II

Performance (p processors) Performance (1 processor) New Applications Time (1 processor) More Performance Time (p processors) Application Trends • Demand for cycles fuels advances in hardware, and vice-versa • Cycle drives exponential increase in microprocessor performance • Drives parallel architecture harder: most demanding applications • Goal of applications in using parallel machines: Speedup Speedup (p processors) = • For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Computer Architecture II

Improving the speedup of Parallel Applications • AMBER molecular dynamics simulation program • Motion of large biological models (proteins, DNA) • 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D • 9/94: optimize the balance • 8/94: optimize the communication Computer Architecture II

Particularly Challenging Computations • Science • Global climate modeling • Astrophysical modeling • Biology: genomics; protein folding; drug design • Computational Chemistry • Computational Material Sciences and Nanosciences • Engineering • Crash simulation • Semiconductor design • Earthquake and structural modeling • Computation fluid dynamics (airplane design) • Combustion (engine design) • Business • Financial and economic modeling • Transaction processing, web services and search engines • Defense • Nuclear weapons -- test by simulations • Cryptography Computer Architecture II

$5B Market in Technical Computing Source: IDC 2004, from USA´s National Research Council Future of Supercomputer Report Computer Architecture II

Scientific Computing Demand Computer Architecture II

NRC report on Future of Supercomputing • “In climate modeling or plasma physics, there is a broad consensus that up to seven orders of magnitude of performance improvements will be needed to achieve well-defined computational goals.” Computer Architecture II

What is Parallel Architecture? • A parallel computer is a collection of processing elements that cooperate to solve large problems fast • Some broad issues: • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale? Computer Architecture II

Why Study Parallel Architecture? • Role of a computer architect: • To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. • Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing Computer Architecture II

1stArchitecture classification • There are several different methods used to classify computers • No single taxonomy fits all designs • Flynn's taxonomy uses the relationship of program instructions to program data. • SISD - Single Instruction, Single Data Stream • SIMD - Single Instruction, Multiple Data Stream • MISD - Multiple Instruction, Single Data Stream (no practical examples) • MIMD - Multiple Instruction, Multiple Data Stream Computer Architecture II

SISD • One instruction stream • One data stream • One instruction issued on each clock cycle • One instruction executed on one element of data (scalar) at a time • Traditional von Neumann architecture Computer Architecture II

SIMD • Also von Neumann architectures but more powerful instructions • Each instruction may operate on more than one data element • Usually intermediate host executes program logic and broadcasts instructions to other processors • Synchronous (lockstep) • Rating how fast these machines can issue instructions is not a good measure of their performance • Two major types: • Vector SIMD • Parallel SIMD Computer Architecture II

Vector SIMD • Single instruction results in multiple operands being updated • Scalar processing operates on single data elements. Vector processing operates on whole vectors (groups) of data at a time. • Examples: • Cell • Cray 1 • NEC SX-2 • Fujitsu VP • Hitachi S820 Computer Architecture II

Parallel SIMD • Several processors execute the same instruction in lockstep • Each processor modifies a different element of data • Drawback: idle processors • Advantage: no explicit synchronization required • Examples • Connection Machine CM-2 • Maspar MP-1, MP-2 Computer Architecture II

MIMD • Several processors executing different instructions on different data • Advantages: • different jobs can be performed at a time • A better utilization can be achieved • Drawbacks: • Explicit synchronization needed • Difficult to program • Examples • MIMD Accomplished via Parallel SISD machines: Sequent, nCUBE , Intel iPSC/2, IBM RS6000 cluster, ALL CLUSTERS • MIMD Accomplished via Parallel SIMD machines: Cray C 90, Cray 2, NEC SX-3, Fujitsu VP 2000, Convex C-2, Intel Paragon, CM 5, KSR-1, IBM SP1, IBM SP2 Computer Architecture II

2nd Classification: Memory architectures • Shared memory • UMA • NUMA • CC-NUMA • Distributed memory • COMA Computer Architecture II

UMA (Uniform Memory Access) Mk M1 M2 … Interconnect P1 P2 … Pn Computer Architecture II

NUMA (Non Uniform Memory Access) PEn PE2 PE1 Pn P2 P1 … Mn M2 M1 Interconnect Computer Architecture II

CC-NUMA (Cache Coherent NUMA) PE1 PE2 PEn P1 P2 Pn C1 C2 Cn M1 M2 Mn Interconnect Computer Architecture II

Distributed memory PE1 PE2 PEn P1 P2 Pn … M1 M2 Mn Interconnect Computer Architecture II

COMA (Cache Only Machine) PE1 PE2 PEn P1 P2 Pn … C1 C2 Cn Interconnect Computer Architecture II

Memory architecture Logical view shared distributed Very important! shared UMA Physical view NUMA M. Dist. distributed scalability Future! “Easy” Programming Computer Architecture II

Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network Computer Architecture II

Clusters and Cluster Computing • Definition of a cluster: • Communication infrastructure: • High performance networks, faster than traditional LAN ( Myrinet, Infiniband, Gbit Ethernet) • Low latency communication protocols • Loosly coupled compared to traditional proprietary supercomputers (eg. IBM SP, Intel Paragon) A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource. [Buyya98] Computer Architecture II

Cluster architecture Computer Architecture II

Clusters and Cluster Computing • Cluster networks: • Ethernet (10Mbps) (*), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps), ATM, Myrinet (1.2Gbps), Fiber Channel, FDDI, Infiniband, etc. • Cluster projects: • Beowulf (CalTech and NASA) - USA • Condor - Wisconsin State University, USA • DQS (Distributed Queuing System) - Florida State University, USA. • HPVM -(High Performance Virtual Machine),UIUC&now UCSB,USA • far - University of Liverpool, UK • Gardens - Queensland University of Technology, Australia • Kerrighed – INRIA, France • MOSIX - Hebrew University of Jerusalem, Israel • NOW (Network of Workstations) - Berkeley, USA Computer Architecture II

Computer architecture II