240 likes | 382 Views
CS 15-447: Computer Architecture. Lecture 26 Emerging Architectures. November 19, 2007 Nael Abu-Ghazaleh naelag@cmu.edu http://www.qatar.cmu.edu/~msakr/15447-f08. Last Time: Buses and I/O. Buses: Bunch of wires Shared Interconnect: multiple “devices” connect to the same bus
E N D
CS 15-447: Computer Architecture Lecture 26Emerging Architectures November 19, 2007 Nael Abu-Ghazaleh naelag@cmu.edu http://www.qatar.cmu.edu/~msakr/15447-f08
Last Time: Buses and I/O • Buses: Bunch of wires • Shared Interconnect: multiple “devices” connect to the same bus • Versatile: new devices can connect (even ones we didn’t know existed when bus was designed) • Can become a bottleneck • Shorter->faster; less devices->faster • Have to: • Define the protocol to make devices communicate • Come up with an arbitration mechanism Control Lines Data Lines
Bus Adaptor Bus Adaptor Types of Buses Processor Memory Bus • System bus • Connects processor and memory • Short, fast, synchronous, design specific • I/O Bus • Usually is lengthy and slower; industry standard • Need to match a wide range of I/O devices • Connects to the processor-memory bus or backplane bus Processor Memory Bus Adaptor I/O Bus Backplane Bus I/O Bus
Bus “Mechanics” • Master Slave • Have to define how we hand-shake • Depends on whether its synchronous or not • Bus arbitration protocol • Contention vs. reservation; centralized vs. distributed • I/O Model • Programmed I/O; Interrupt driven I/O; DMA • Increasing performance (mainly bandwidth) • Shorter; closer; wider • Block transfers (instead of byte transfers) • Split transaction buses • …
Today—Emerging Architectures • We are at an interesting point in computer architecture evolution • What is emerging and why is it emerging?
Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006 ??%/year Sea change in chip design—what is emerging? • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present
How did we get there? • First, what allowed the ridiculous 52% improvement per year to continue for around 20 years? • If cars improved as much we would have 1 million Km/hr cars! • Is it just the number of transistors/clock rate? • No! Its also all the stuff that we’ve been learning about!
Walk down memory lane • What was the first processor organization we looked at? • Single cycle processors • How did multi-cycle processors improve those? • What did we do after that to improve performance? • Pipelining; why does that help? What are the limitations? • From there we discussed superscalar architectures • Out of order execution; multiple ALUs • This is basically state of the art in uniprocessors • What gave us problems there?
Detour: couple of other design points • Very Large Instruction Word Architectures; let the compiler do the work • Great for energy efficiency—less Instruction Level Parallelism • Not binary compatible? Trasnmeta Crusoe Processor
SIMD ISA Extensions—Parallelism from the Data? • Same Instruction applied to multiple Data at the same time • How can this help? • MMX (Intel) and 3DNow! (AMD) ISA extensions • Great for graphics; originally invented for scientific codes (vector processors) • Not a general solution • End of detour!
Back to Moore’s law • Why are the “good times” over? • Three walls • “Instruction Level Parallelism” (ILP) Wall • Less parallelism available in programs (2->4->8->16) • Tremendous increase in complexity to get more • Does VLIW help? • What can help? • Conclusion: standard architectures cannot continue to do their part of sustaining Moore’s law
Wall 2: Memory Wall 1000 µProc 52%/yr. (2X/1.5yr) • What did we do to help this? • Still very very expensive to access memory • How do we see the impact in practice? • Very different from when I learned architecture! “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) CPU 10 DRAM 9%/yr. (2X/10 yrs) Performance DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Ways out? Multithreaded Processors • Can we switch to other threads if we need to access memory? • When do we need to access memory? • What support is needed? • Can I use it to help with the ILP wall as well?
Symmetric Multithreaded Processors • How do I switch between threads? • Hardware support for that • How does this help? • But, increased contention for everything (BW, TLB, caches…)
Third Wall: Physics/Power wall • We’re down to the level of playing with a few atoms • More error prone; lower yield • But also soft-errors and wear out • Logic that sometimes works! • Can we do something in architecture to recover?
So, what is our way out? Any ideas? • Maybe architecture becomes commodity; this is the best we can do • This happens to a lot of technologies: why don’t we have the million km/hr car? • Do we actually need more processing power? • 8 bit embedded processors good enough for calculators; 4 bit ones probably good enough for elevators • Is there any sense to continue investing so much time and energy into this stuff? Power Wall + Memory Wall + ILP Wall = Brick Wall
A lifeline? Multi-core architectures • How does this help? • Think of the three walls • The new Moore’s law: • the number of cores will double every 3 years! • Many-core architectures
Overcoming the three walls • ILP Wall? • Don’t need to restrict myself to a single thread • Natural parallelism available across threads/programs • Memory wall? • Hmm, that is a tough one; on the surface, seems like we made it worse • Maybe help coming from industry • Physics/power wall? • Use less aggressive core technology • Simpler processors, shallower pipelines • But more processors • Throw-away cores to improve yield • Do you buy it?
7 Questions for Parallelism • Applications: 1. What are the apps? 2. What are kernels of apps? • Hardware: 3. What are the HW building blocks? 4. How to connect them? • Programming Models: 5. How to describe apps and kernels? 6. How to program the HW? • Evaluation: 7. How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley)
Sea Change in Chip Design • Intel 4004 (1971): 4-bit processor,2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip • RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip • 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache • RISC II shrinks to 0.02 mm2 at 65 nm Processor is the new transistor!
Architecture Design space • What should each core look like? • Should all cores look the same? • How should the chip interconnect between them look? • What level of the cache should they share? • And what are the implications of that? • Are there new security issues? • Side channel attacks; denial of service attacks • Many other questions… Brand new playground; exciting time to do architecture research
Hardware Building Blocks: Small is Beautiful • Given difficulty of design/validation of large designs • Given power limits what can build, parallel is energy efficient way to achieve performance • Lower threshold voltage means much lower power • Given redundant processors can improve chip yield • Cisco Metro 188 processors + 4 spares • Sun Niagara sells 6 or 8 processor version • Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, SIMD PEs • One size fits all? • Amdahl’s Law a few fast cores + many small cores
Elephant in the room • We tried this parallel processing thing before • Very difficult • It failed, pretty much • A lot of academic progress and neat algorithms, but little impact commercially • We actually have to do new programming • A lot of effort to develop; error prone; etc.. • La-Z-boy programming era is over • Need new programming models • Amdahl’s law • Applications: What will you use 1024 cores for? • These concerns are being voiced by a substantial segment of academia/industry • What do you think? • Its coming, no matter what