Computational Methods in Astrophysics ASTR 5210

Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E) thacker@ap.smu.ca

Today’s Lecture • More Computer Architecture • Flynn’s Taxonomy • Improving CPU performance (instructions per clock) • Instruction Set Architecture classifications • Future of CPU design

Machine architecture classifications • Flynn’s taxonomy (see IEEE Trans. Comp. Vol C-21, pp 94, 1972) • A way of describing the information flow in computers: architectural definition • Information is divided into instructions (I) and data (D) • There can be single (S) or multiple instances of both (M) • Four combinations: SISD,SIMD,MISD,MIMD

SISD • Single Instruction, Single Data • An absolutely serial execution model • Typically viewed as describing a serial computer, but todays CPUs exploit parallelism Single data element Single processor P M

SIMD • Single Instruction, Multiple Data • In this case one instruction is applied to multiple data streams at the same time P Ma Each processor typically has its own data memory K P Ma P Ma Single instruction processor K, broadcasts instruction to processing elements (PEs) Array of processors

MISD • Multiple Instruction, Single Data • Largely useless definition (not important) • Closest relevant example would be a cpu than can `pipeline’ instructions Ma Example: systolic array, network of small elements connected in a regular grid operating under a global clock, reading and writing elements from/to neighbours. Mi P Each processor has its own instruction stream but operates on the same data stream Mi P Mi P

MIMD • Multiple Instruction, Multiple Data • Covers a host of modern architectures M M Processors have independent data and instruction streams. Processors may communicate directly or via shared memory. P P P P M M

Instruction Set Architecture • ISA – interface between hardware and software • ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) • Assembly language is a realization of the ISA in a form easy to remember (and program)

Key Concept in ISA evolution and CPU design • Efficiency gains to be had by executing as many operations per clock cycle as possible • Instruction level parallelism (ILP) • Exploit parallelism within the instruction stream • Programmer does not see this parallelism explicitly • Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI)

ILP versus thread level parallelism • Many modern programs have more than one (parallel) “thread” of execution • Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level One “thread” Instructions These instructions are executed in parallel even though there is one thread 3 1 2 3 2 1

ILP techniques • The two main ILP techniques are • Pipelining – including additional techniques such as out-of-order execution • Superscalar execution

Instr 7 Instr 6 Instr 5 Instr 4 Instr 3 Instr 2 Instr 1 Pipelining • Multiple instructions overlapped in execution • Throughput optimization: doesn’t reduce time for individual instructions Instr 2 Instr 1 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7

Design sweetspot • Pipeline stepping time is determined by slowest operation in pipeline • Best speed-up: if all operations take same amount of time • Net time per instruction=stepping time/pipeline stages • Perfect speed up factor = # pipeline stages • Never achieved: start up overheads to consider

Time to issue instruction =55ns 10ns 10ns 5ns 10ns 5ns 10ns 5ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Instruction 10ns =70ns 10ns 10ns 10ns 10ns 10ns 10ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 These stages take longer than necessary Pipeline compromises

Superscalar execution • Careful about definitions: superscalar execution is not simply about having multiple instructions in flight • Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store)

Benefits of superscalar design • Having more than one functional unit of a given type can help schedule more instructions within the pipeline • The Pentium IV pipeline was 20 stages deep! • Enormous throughput potential but big pipeline stall penalty • Incorporation of multiple units into the pipeline is sometimes called superpipelining

Other ways of increasing ILP • Branch prediction • Predict which path will be taken by assigning certain probabilities • Out of order execution • Independent operations can be rescheduled in the instruction stream • Pipelined functional units • Floating point units can be pipelined to increase throughput

Limits of ILP • See D. Wall “Limits of ILP” 1991 • Probability of hitting hazards (instructions that cannot be pipelined) increases with added length • Instruction fetch and decode rate • Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… • Branch prediction – • Multiple condition statements increase branches severely • Cache locality and memory limitations • Finite limits to effectiveness of prefetch

Scalar Processor Architectures ‘Scalar’ Superscalar Pipelined Functional unit parallelism, e.g. load/store and arithmetic units can be used in parallel (instructions in parallel) Multiple functional units, e.g. 4 floating point units can operate at same time Modern processors exploit parallelism, and can’t really be called SISD

Complex Instruction Set Computing • CISC – older design idea (x86 instruction set is CISC) • Many (powerful) instructions supported within the ISA • Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) • Upside: Reduced instruction memory usage • Downside: designing CPU is much harder

Reduced Instruction Set Computing • RISC – newer concept than CISC (but still old) • MIPS, PowerPC, SPARC, all RISC designs • Small instruction set, CISC type operation becomes a chain of RISC operations • Upside: Easier to design CPU • Upside: Smaller instruction set => higher clock speed • Downside: assembly language typically longer (compiler design issue though) • Most modern x86 processors are implemented using RISC techniques

Birth of RISC • Roots can be traced to three research projects • IBM 801 (late 1970s, J. Cocke) • Berkeley RISC processor (~1980, D. Patterson) • Stanford MIPS processor (~1981, J. Hennessy) • Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment • Commercialization benefitted from 3 independent projects • Berkeley Project -> begat Sun Microsystems • Stanford Project -> begat MIPS (used by SGI)

Modern RISC processors • Complexity has nonetheless increased significantly • Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations • What if we could remove the scheduling complexity by using a smart compiler…?

VLIW & EPIC • VLIW – very long instruction word • Idea: pack a number of noninterdependent operations into one long instruction • Strong emphasis on compilers to schedule instructions • When executed, words are easily broken up and allow operations to be dispatched to independent execution units Instr 1 Instr 2 Instr 3 3 instructions scheduled into one long instruction word

VLIW & EPIC II • Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs • VLIW processors should be faster and less expensive than RISC • EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW • ISA is called IA-64

VLIW & EPIC III • Hey – it’s 2015, why aren’t we all using Intel Itanium processors? • AMD figured out an easy extension to make x86 support 64 bits & introduced multicore • Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64

RISC vs CISC recap *Driven by 1970s issues of memory size (SMALL) and speed (FASTER THAN CPU)

Who “won”? – Not VLIW! • Modern x86 are RISC-CISC hybrids • ISA is translated at hardware level to shorter instructions • Very complicated designs though, lots of scheduling hardware • MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal • Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory?

From Patterson’s lectures (UC Berkeley CS252) Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC (Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987) LIW/”EPIC”? (IA-64. . .1999)

Simultaneous multithreading • Completely different technology to ILP • NOT multi-core • Designed to overcome lack of fine grained parallelism in code • Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales • Requires programmer to have created a parallel program for this to work though • One physical processor looks like two logical processors

Motivation for SMT • Strong motivation for SMT: memory latency making load operations take longer and longer • Need some way to hide this bottleneck (memory wall again!) • SMT: switch over execution to threads that have their data and execute those • TERA MTA (bought by Cray) attempt to design computer entirely around this concept

SMT Example: IBM Power 5 - 8 • Dual core, each core can support 2 SMT threads • “MCM” package • 4 dual core processors • 144 MB of cache • SMT gives ~40-60% improvement in performance • Not bad • Intel Hyperthreading ~ 10% improvement

Multiple cores • Simply add more CPUs • Easiest way to increase throughput now • Why do this? • Response to problem of increasing power output on modern CPUs • We’ve essentially reached the limit on improving individual core speeds • Design involves compromise: n CPUs must now share memory bus – less bandwidth to each

Intel 18-core processors Codename “Haswell” Design envelope 150W, but divide by number of processors => each core is v. power efficient AMD has 16 core processors Codename “Warsaw” 115 W design envelope Individual cores not as good as Intel though Intel & AMD multi-core processors

Summary • Flynn’s taxonomy categorizes instruction and data flow in computers • Modern processors are MIMD • Pipelining and superscalar design improve CPU performance by increasing the instructions per clock • CISC/RISC design approaches appear to be reaching the limits of their applicability • VLIW didn’t make an impact – will it return? • In the absence of improved single core performance, designers are simply integrating more cores

Computational Methods in Astrophysics ASTR 5210