1 / 35

Embracing Parallelism in Modern Computer Architecture

Explore the shift towards parallelism from ILP in computer architecture, covering multithreading, vector processing, memory hierarchy, and more.

crockettk
Download Presentation

Embracing Parallelism in Modern Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE (CE) 6304 Computer ArchitectureLecture #3(8/29/17) Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas Course Web-site: http://www.utdallas.edu/~gxm112130/EE6304FA17

  2. Have we reached the end of ILP? • Multiple processor easily fit on a chip • Every major microprocessor vendor has gone to multithreaded cores • Thread: loci of control, execution context • Fetch instructions from multiple threads at once, throw them all into the execution unit • Intel: hyperthreading • Concept has existed in high performance computing for 20 years (or is it 40? CDC6600) • Vector processing • Each instruction processes many distinct data • Ex: MMX • Raise the level of architecture – many processors per chip Tensilica Configurable Proc

  3. Limiting Forces: Clock Speed and ILP • Chip density is continuing increase ~2x every 2 years • Clock speed is not • # processors/chip (cores) may double instead • There is little or no more Instruction Level Parallelism (ILP) to be found • Can no longer allow programmer to think in terms of a serial programming model • Conclusion:Parallelism must be exposed to software! Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

  4. P P P P Bus Memory P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M Host P/M P/M P/M P/M Network Examples of MIMD Machines • Symmetric Multiprocessor • Multiple processors in box with shared memory communication • Current MultiCore chips like this • Every processor runs copy of OS • Non-uniform shared-memory with separate I/O through host • Multiple processors • Each with local memory • general scalable network • Extremely light “OS” on node provides simple services • Scheduling/synchronization • Network-accessible host for I/O • Cluster • Many independent machine connected with general network • Communication through messages

  5. Categories of Thread Execution Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

  6. Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

  7. The Memory Abstraction • Association of <name, value> pairs • typically named as byte addresses • often values aligned on multiples of size • Sequence of Reads and Writes • Write binds a value to an address • Read of addr returns most recently written value bound to that address command (R/W) address (name) data (W) data (R) done

  8. Memory Hierarchy • Take advantage of the principle of locality to: • Present as much memory as in the cheapest technology • Provide access at speed offered by the fastest technology Processor Tertiary Storage (Tape/Cloud Storage) Control Secondary Storage (Disk/FLASH/PCM) Main Memory (DRAM/FLASH/ PCM) Second Level Cache (SRAM) On-Chip Cache Datapath Registers 10,000,000s (10s ms) Speed (ns): 1s 10s-100s 100s 10,000,000,000s (10s sec) Size (bytes): 100s Ks-Ms Ms Gs Ts

  9. MEM P $ The Principle of Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 30 years, HW relied on locality for speed

  10. Example of modern core: Nehalem • ON-chip cache resources: • For each core: L1: 32K instruction and 32K data cache, L2: 1MB • L3: 8MB shared among all 4 cores • Integrated, on-chip memory controller (DDR3)

  11. P P n 1 P P n 1 $ $ $ $ Mem Mem Inter connection network Inter connection network Mem Mem Memory Abstraction and Parallelism • Maintaining the illusion of sequential access to memory across distributed system • What happens when multiple processors access the same memory at once? • Do they see a consistent picture? • Processing and processors embedded in the memory?

  12. Proc Caches Busses adapters Memory Controllers Disks Displays Keyboards I/O Devices: Networks Is it all about communication? Pentium IV Chipset

  13. Breaking the HW/Software Boundary • Moore’s law (more and more trans) is all about volume and regularity • What if you could pour nano-acres of unspecific digital logic “stuff” onto silicon • Do anything with it. Very regular, large volume • Field Programmable Gate Arrays • Chip is covered with logic blocks w/ FFs, RAM blocks, and interconnect • All three are “programmable” by setting configuration bits • These are huge? • Can each program have its own instruction set? • Do we compile the program entirely into hardware?

  14. Number Crunching Data Storage productivity interactive “Bell’s Law” – new class per decade log (people per computer) streaming information to/from physical world • Enabled by technological opportunities • Smaller, more numerous and more intimately connected • Brings in a new kind of application • Used in many ways not previously imagined year

  15. It’s not just about bigger and faster! • Complete computing systems can be tiny and cheap • System on a chip • Resource efficiency • Real-estate, power, pins, …

  16. Understanding & Quantifying Cost,Performance, Power, Dependability & Reliability

  17. Integrated Circuit Cost • Integrated circuit • Bose-Einstein formula: • Defects per unit area = 0.016-0.057 defects per square cm (2010) • N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

  18. DC to Paris Speed Passengers Throughput (pmph) 6.5 hours 610 mph 470 286,700 3 hours 1350 mph 132 178,200 Which is faster? Plane Boeing 747 BAD/Sud Concorde • Time to run the task (ExTime) • Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) • Throughput, bandwidth

  19. performance(x) = 1 execution_time(x) Performance(X) Execution_time(Y) n = = Performance(Y) Execution_time(X) Definitions • Performance is in units of things per sec • bigger is better • If we are primarily concerned with response time • " X is n times faster than Y" means

  20. CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPI Processor performance equation inst count Cycle time Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

  21. Cycles Per Instruction(Throughput) “Average Cycles per Instruction” • CPI = (CPU Time * Clock Rate) / Instruction Count • = Cycles / Instruction Count “Instruction Frequency”

  22. Example: Calculating CPI bottom up Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix of instruction types in program Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

  23. Example: Branch Stall Impact • Assume CPI = 1.0 ignoring branches (ideal) • Assume branch was stalling for 3 cycles • If 30% branch, Stall 3 cycles on 30% • Op Freq Cycles CPI(i) (% Time) • Other 70% 1 .7 (37%) • Branch 30% 4 1.2 (63%) • => new CPI = 1.9 • New machine is 1/1.9 = 0.52 times faster (i.e. slow!)

  24. Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1:

  25. Making common case fast • Many a time an architect spends tremendous effort and time to optimize some aspect of system • Later realize that overall speedup is unrewarding • So, better to measure the usage of that aspect of system, before attempt to optimize it • In making a design trade-off • Favor the frequent case over the infrequent case • In allocating additional resources • Allocate to improve frequent event, rather than a rare event So, what principle quantifies this scenario?

  26. Amdahl’s Law Best you could ever hope to do:

  27. Amdahl’s Law example • New CPU 10X faster • I/O bound server, so 60% time waiting for I/O • Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

  28. Define and quantify power ( 1 / 2) • For CMOS chips, traditional dominant energy consumption has been in switching transistors, called dynamic power • For mobile devices, energy better metric • For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy • Capacitive load a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors • Dropping voltage helps both, so went from 5V to 1V • To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit)

  29. Example of quantifying power • Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power?

  30. Define and quantify power (2 / 2) • Because leakage current flows even when a transistor is off, now static power important too • Leakage current increases in processors with smaller transistor sizes • Increasing the number of transistors increases power even if they are turned off • In 2006, goal for leakage was 25% of total power consumption; high performance designs at 40% • Very low power systems even gate voltage to inactive modules to control loss due to leakage

  31. Power and Energy • Energy to complete operation (Joules) • Corresponds approximately to battery life • (Battery energy capacity actually depends on rate of discharge) • Peak power dissipation (Watts = Joules/second) • Affects packaging (power and ground pins, thermal design) • di/dt, peak change in supply current (Amps/second) • Affects power supply noise (power and ground pins, decoupling capacitors)

  32. Peak Power versus Lower Energy Peak A • System A has higher peak power, but lower total energy • System B has lower peak power, but higher total energy Peak B Power Integrate power curve to get energy Time

  33. Define and quantify dependability (1/3) • How decide when a system is operating properly? • Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable • Systems alternate between 2 states of service with respect to an SLA: • Service accomplishment, where the service is delivered as specified in SLA • Service interruption, where the delivered service is different from the SLA • Failure = transition from state 1 to state 2 • Restoration = transition from state 2 to state 1

  34. Define and quantify dependability (2/3) • Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics • Mean Time To Failure (MTTF) measures Reliability • Failures In Time (FIT) = 1/MTTF, the rate of failures • Traditionally reported as failures per billion hours of operation • Mean Time To Repair (MTTR) measures Service Interruption • Mean Time Between Failures (MTBF) = MTTF+MTTR • Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) • Module availability = MTTF / ( MTTF + MTTR)

  35. Example calculating reliability • If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules • Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):

More Related