Enhancing Computing Performance: Trends & Techniques

The Tim Allen View of Computing • Faster!!! • Bigger Problems • I want 7 days of weather not 2 • I want 1024x1024x16-bit color • … • Most modern applications such as weather prediction, aerodynamics and bioinformatics are computationally intensive Overview

Clock Speeds Have Been Increasing Culler, D., Singh, J.P., Gupta, A., “Parallel Computer Architecture: A Hardware/Software Approach”, Morgan Kaufman Publishers Overview

Hardware Continues to Improve Motherboard performance measured in MHz/$ Source: http://www.zoology.ubc.ca/~rikblok/ComputingTrends/ Overview

Moravec, Hans, “When will computer hardware match the human brain?” Journal of Evolution and Technology. 1998. Vol. 1

Problem Size Is Growing Faster http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Overview

Why High Performance? • To calculate a 24 hour forecast for the UK requires about 1012 operations to be performed • Takes about 2.7 hours on a machine capable of 108 operations per second • What about a 7 day forecast? • Okay so buy a faster computer … • The speed of light is 3x108m/s. Consider two electronic devices (each capable of performing 1012 operations/second) 0.5mm apart. It takes longer for the signal to travel between them than it takes for either of them to process it. Overview

Why Are Things Getting Better? • The huge increase in computing speed since the 50s can be attributed to • Faster components • More efficient algorithms • More sophisticated architectures • Most “advanced” architectures attempt to eliminate the von Neumann bottleneck Overview

Von Neumann Architecture Memory May not have bandwidth Required to match the CPU Typically much slower Than the CPU CPU • Some improvements have included • Interleaved Memory • Caching • Pipelining Overview

The Bottom Line • It is getting harder to extract the performance modern applications require out of a single processor machine • Given practical constraints, such as the speed of light, using multiple processing elements to solve a problem is the only way to go Overview

Concurrent Programming • Operations that occur one after another, ordered in time, are said to be sequential • Operations are concurrent if they could be, but need not be, executed in parallel • A web browser often loads images concurrently • Concurrency in a program and parallelism in the underlying hardware are independent concepts Overview

Parallel Computing • A large collection of processing elements that can communicate and cooperate to solve large problems quickly • A form of information processing which uses concurrent events during execution • In other words both the language and the hardware support concurrency Overview

Parallelism • If several operations can be performed simultaneously, the total computation time is reduced • Here the parallel version has the potential of being 3 times faster Overview

Measuring Performance • How should the performance of a parallel computation be measured? • Traditional measures like MIPS and MFLOPS really don’t cut it. • New ways to measure parallel performance are needed • Speedup • Efficiency Overview

Speedup • Speedup is the most often used measure of parallel performance. • If • Ts is the best possible serial time • Tn is the time taken by a parallel algorithm on n processors • Then • Speedup = Ts / Tn Overview

Read Between the Lines • Exactly what is meant by Ts (i.e. the time taken to run the fastest serial algorithm on one processor)? • One processor of the parallel computer? • The fastest serial machine available? • A parallel algorithm run on a single processor? • Is the serial algorithm the best one? • To keep things fair, Ts should be the best possible time in the serial world. Overview

Speedup† • A slightly different definition of speedup also exists. • The time taken by the parallel algorithm on one processor divided by the time taken by the parallel algorithm on N processors. • However this is misleading since many parallel algorithms contain extra operations to accommodate the parallelism (e.g the communication) . • The result is Ts is increased thus exaggerating the speedup. Overview

Factors That Limit Speedup • Software Overhead • Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation. • Load Balancing • Speedup is generally limited by the speed of the slowest node. So an important consideration is to ensure that each node performs the same amount of work. • Communication Overhead • Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processors directly degrades the speedup. Overview

Linear Speedup • Which ever definition is used the ideal is to produce linear speedup. • A speedup of N using N processors • However in practice the speedup is reduced from its ideal value of N. • Super linear speedup results when • unfair values are used for Ts • Differences in the nature of the hardware used Overview

Speedup Curves Super linear Speedup Linear Speedup Speedup Typical Speedup Number of Processors Overview

Maximum Speedup • All parallel program will have a portion that cannot be parallelized • Given • F is the fraction of the computation that must be done serially • Ts is the best possible sequential time • The time to perform the computation with p processors is • FTs + ( 1 - F ) Ts / p • Therefore the speedup would be • Ts / ( FTs + ( 1 - F ) Ts / p ) = p / ( 1 + ( p - 1 ) F ) • This seems to be bad news since as p increases the speedup becomes 1/F • This is known as Amdahl’s Law Overview

Efficiency • Speed up does not measure how efficiently the processors are being used. • Is it worth using 100 processors to get a speedup of 2? • Efficiency is defined as the ratio of the speedup and the number of processors required to achieve it. • The efficiency is bounded from above by 1. Overview

Scalability • Scalability refers to the ability of a system to grow • Hardware • Software • For example • How difficult is it to add another 10 processors to a system? • How difficult is it to increase the image being calculated from 5 mega-pixels to 10? Overview

Parallel Architectures • Unlike traditional von Neumann machines, there is no single standard architecture used on parallel machines. • In fact dozens of different parallel architectures have been built and are being used. • Several people have tried to classify the different types of parallel machines. • The taxonomy proposed by Flynn is the most commonly used. Overview

Flynn’s Model of Computation • Any computer, whether sequential or parallel, operates by executing instructions on data. • a stream of instructions (the algorithm) tells the computer what to do. • a stream of data (the input) is affected by these instructions. • Depending on whether there is one or several of these streams, Flynn’s taxonomy defines four classes of computers. Overview

Flynn’s Taxonomy Data Streams Instruction Streams Overview

SISD Computers • Standard sequential computers • A single processing element receives a single stream of instructions that operate on a single stream of data • No parallelism here Overview

SIMD Computers • All processors operate under the control of a single instruction stream • Processors can be selected under program control • There are N data streams, one per processor • Variables can live either in the parallel machine or the scalar host • Often referred to as data parallel computing Overview

SIMD Controller PE PE PE PE Memory Memory Memory Memory … Interconnection Network Overview

SIMD Algorithm • Calculate the number of heads in a series of coin tosses • All processors flip a coin • If ( coin is a head ) raise your hand • Note: there are better ways to count the hands Overview

MIMD Computers • This is the most general and most powerful of Flynn’s taxonomy • N processors, N streams of instructions and N streams of data • Each processor operates under the control of an instruction stream issued by its own control unit • each processor is capable of executing its own program on a different data • Processors operate asynchronously and can be doing different things on different data at the same time. Overview

MIMD – Shared Memory PE PE PE … PE Interconnection Network MIMD computers with shared memory are known as multiprocessors or tightly coupled machines. Overview

MIMD – Message Passing PE PE PE PE Memory Memory Memory Memory … Interconnection Network MIMD computers with an interconnection network are known as multicomputers or loosely coupled machines. Multicomputers are sometimes referred to as distributed systems which is incorrect. Distributed systems refer to a network of computer and even though the number of processing units can be large the communication is typically slow Overview

MIMD Algorithm • Generate the prime numbers from 2 to 20 • Receive a number • This is your prime • Repeat • Receive a Number • If this number is not evenly divisible by your prime, pass it to the next processor Overview

Another MIMD Algorithm • Given a list of 5 numbers • Sort your list of numbers • Processors 1, 3, 5, and 7 give your list to processor (n-1) • Processors 0, 2, 4, 6 merge your list with the new list • Processors 2, 6 give your list to processor (n-2) • Processors 0 and 4 merge your list with the new list • Processor 4 give your list to processor 0 • Processor 0 merge your list with the new list • Processor 0 give your list to me Overview

The 4 Classes Overview

SPMD Computing • SPMD stands for single program multiple data • The same program is run on the processors of an MIMD machine • Occasionally the processors may synchronize • Because an entire program is executed on separate data, it is possible that different branches are taken, leading to asynchronous parallelism • SPMD can about as a desire to do SIMD like calculations on MIMD machines • SPMD is not a hardware paradigm, it is the software equivalent of SIMD Overview

Conclusions • Parallel/Distributed computing seems to be the only way to achieve the computing power required for the current generation of applications. • Speedup and efficiency are common measures of parallel computing. • Flynn proposed a taxonomy to classify parallel architectures. • SIMD • MIMD • Software needs to be retooled to take advantage of the high-performance environment. Overview

Enhancing Computing Performance: Trends & Techniques