1 / 35

Computational Methods in Astrophysics ASTR 5210

Computational Methods in Astrophysics ASTR 5210. Dr Rob Thacker (AT319E) thacker@ap.smu.ca. Today’s Lecture. More Computer Architecture Flynn’s Taxonomy Improving CPU performance (instructions per clock) Instruction Set Architecture classifications Future of CPU design.

karolynp
Download Presentation

Computational Methods in Astrophysics ASTR 5210

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E) thacker@ap.smu.ca

  2. Today’s Lecture • More Computer Architecture • Flynn’s Taxonomy • Improving CPU performance (instructions per clock) • Instruction Set Architecture classifications • Future of CPU design

  3. Machine architecture classifications • Flynn’s taxonomy (see IEEE Trans. Comp. Vol C-21, pp 94, 1972) • A way of describing the information flow in computers: architectural definition • Information is divided into instructions (I) and data (D) • There can be single (S) or multiple instances of both (M) • Four combinations: SISD,SIMD,MISD,MIMD

  4. SISD • Single Instruction, Single Data • An absolutely serial execution model • Typically viewed as describing a serial computer, but todays CPUs exploit parallelism Single data element Single processor P M

  5. SIMD • Single Instruction, Multiple Data • In this case one instruction is applied to multiple data streams at the same time P Ma Each processor typically has its own data memory K P Ma P Ma Single instruction processor K, broadcasts instruction to processing elements (PEs) Array of processors

  6. MISD • Multiple Instruction, Single Data • Largely useless definition (not important) • Closest relevant example would be a cpu than can `pipeline’ instructions Ma Example: systolic array, network of small elements connected in a regular grid operating under a global clock, reading and writing elements from/to neighbours. Mi P Each processor has its own instruction stream but operates on the same data stream Mi P Mi P

  7. MIMD • Multiple Instruction, Multiple Data • Covers a host of modern architectures M M Processors have independent data and instruction streams. Processors may communicate directly or via shared memory. P P P P M M

  8. Instruction Set Architecture • ISA – interface between hardware and software • ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) • Assembly language is a realization of the ISA in a form easy to remember (and program)

  9. Key Concept in ISA evolution and CPU design • Efficiency gains to be had by executing as many operations per clock cycle as possible • Instruction level parallelism (ILP) • Exploit parallelism within the instruction stream • Programmer does not see this parallelism explicitly • Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI)

  10. ILP versus thread level parallelism • Many modern programs have more than one (parallel) “thread” of execution • Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level One “thread” Instructions These instructions are executed in parallel even though there is one thread 3 1 2 3 2 1

  11. ILP techniques • The two main ILP techniques are • Pipelining – including additional techniques such as out-of-order execution • Superscalar execution

  12. Instr 7 Instr 6 Instr 5 Instr 4 Instr 3 Instr 2 Instr 1 Pipelining • Multiple instructions overlapped in execution • Throughput optimization: doesn’t reduce time for individual instructions Instr 2 Instr 1 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7

  13. Design sweetspot • Pipeline stepping time is determined by slowest operation in pipeline • Best speed-up: if all operations take same amount of time • Net time per instruction=stepping time/pipeline stages • Perfect speed up factor = # pipeline stages • Never achieved: start up overheads to consider

  14. Time to issue instruction =55ns 10ns 10ns 5ns 10ns 5ns 10ns 5ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Instruction 10ns =70ns 10ns 10ns 10ns 10ns 10ns 10ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 These stages take longer than necessary Pipeline compromises

  15. Superscalar execution • Careful about definitions: superscalar execution is not simply about having multiple instructions in flight • Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store)

  16. Benefits of superscalar design • Having more than one functional unit of a given type can help schedule more instructions within the pipeline • The Pentium IV pipeline was 20 stages deep! • Enormous throughput potential but big pipeline stall penalty • Incorporation of multiple units into the pipeline is sometimes called superpipelining

  17. Other ways of increasing ILP • Branch prediction • Predict which path will be taken by assigning certain probabilities • Out of order execution • Independent operations can be rescheduled in the instruction stream • Pipelined functional units • Floating point units can be pipelined to increase throughput

  18. Limits of ILP • See D. Wall “Limits of ILP” 1991 • Probability of hitting hazards (instructions that cannot be pipelined) increases with added length • Instruction fetch and decode rate • Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… • Branch prediction – • Multiple condition statements increase branches severely • Cache locality and memory limitations • Finite limits to effectiveness of prefetch

  19. Scalar Processor Architectures ‘Scalar’ Superscalar Pipelined Functional unit parallelism, e.g. load/store and arithmetic units can be used in parallel (instructions in parallel) Multiple functional units, e.g. 4 floating point units can operate at same time Modern processors exploit parallelism, and can’t really be called SISD

  20. Complex Instruction Set Computing • CISC – older design idea (x86 instruction set is CISC) • Many (powerful) instructions supported within the ISA • Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) • Upside: Reduced instruction memory usage • Downside: designing CPU is much harder

  21. Reduced Instruction Set Computing • RISC – newer concept than CISC (but still old) • MIPS, PowerPC, SPARC, all RISC designs • Small instruction set, CISC type operation becomes a chain of RISC operations • Upside: Easier to design CPU • Upside: Smaller instruction set => higher clock speed • Downside: assembly language typically longer (compiler design issue though) • Most modern x86 processors are implemented using RISC techniques

  22. Birth of RISC • Roots can be traced to three research projects • IBM 801 (late 1970s, J. Cocke) • Berkeley RISC processor (~1980, D. Patterson) • Stanford MIPS processor (~1981, J. Hennessy) • Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment • Commercialization benefitted from 3 independent projects • Berkeley Project -> begat Sun Microsystems • Stanford Project -> begat MIPS (used by SGI)

  23. Modern RISC processors • Complexity has nonetheless increased significantly • Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations • What if we could remove the scheduling complexity by using a smart compiler…?

  24. VLIW & EPIC • VLIW – very long instruction word • Idea: pack a number of noninterdependent operations into one long instruction • Strong emphasis on compilers to schedule instructions • When executed, words are easily broken up and allow operations to be dispatched to independent execution units Instr 1 Instr 2 Instr 3 3 instructions scheduled into one long instruction word

  25. VLIW & EPIC II • Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs • VLIW processors should be faster and less expensive than RISC • EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW • ISA is called IA-64

  26. VLIW & EPIC III • Hey – it’s 2015, why aren’t we all using Intel Itanium processors? • AMD figured out an easy extension to make x86 support 64 bits & introduced multicore • Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64

  27. RISC vs CISC recap *Driven by 1970s issues of memory size (SMALL) and speed (FASTER THAN CPU)

  28. Who “won”? – Not VLIW! • Modern x86 are RISC-CISC hybrids • ISA is translated at hardware level to shorter instructions • Very complicated designs though, lots of scheduling hardware • MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal • Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory?

  29. From Patterson’s lectures (UC Berkeley CS252) Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC (Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987) LIW/”EPIC”? (IA-64. . .1999)

  30. Simultaneous multithreading • Completely different technology to ILP • NOT multi-core • Designed to overcome lack of fine grained parallelism in code • Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales • Requires programmer to have created a parallel program for this to work though • One physical processor looks like two logical processors

  31. Motivation for SMT • Strong motivation for SMT: memory latency making load operations take longer and longer • Need some way to hide this bottleneck (memory wall again!) • SMT: switch over execution to threads that have their data and execute those • TERA MTA (bought by Cray) attempt to design computer entirely around this concept

  32. SMT Example: IBM Power 5 - 8 • Dual core, each core can support 2 SMT threads • “MCM” package • 4 dual core processors • 144 MB of cache • SMT gives ~40-60% improvement in performance • Not bad • Intel Hyperthreading ~ 10% improvement

  33. Multiple cores • Simply add more CPUs • Easiest way to increase throughput now • Why do this? • Response to problem of increasing power output on modern CPUs • We’ve essentially reached the limit on improving individual core speeds • Design involves compromise: n CPUs must now share memory bus – less bandwidth to each

  34. Intel 18-core processors Codename “Haswell” Design envelope 150W, but divide by number of processors => each core is v. power efficient AMD has 16 core processors Codename “Warsaw” 115 W design envelope Individual cores not as good as Intel though Intel & AMD multi-core processors

  35. Summary • Flynn’s taxonomy categorizes instruction and data flow in computers • Modern processors are MIMD • Pipelining and superscalar design improve CPU performance by increasing the instructions per clock • CISC/RISC design approaches appear to be reaching the limits of their applicability • VLIW didn’t make an impact – will it return? • In the absence of improved single core performance, designers are simply integrating more cores

More Related