410 likes | 525 Views
Introduction. Spring 2005 Seungryoul maeng. Course Introduction. Instructor: Seungryoul Maeng maeng@cs.kaist.ac.kr 042-869-3519 (office) Web: http://camars.kaist.ac.kr/~maeng Office hours: Mon 1:00-2:30, Wed 1:00-2:30 Office: Computer Science Building 4403
E N D
Introduction Spring 2005 Seungryoul maeng
Course Introduction • Instructor: Seungryoul Maeng maeng@cs.kaist.ac.kr 042-869-3519 (office) • Web: http://camars.kaist.ac.kr/~maeng • Office hours: Mon1:00-2:30, Wed 1:00-2:30 • Office: Computer Science Building 4403 • Teaching Assistants • Min Choi mchoi@camars.kaist.ac.kr • WWW address • http://calab.kaist.ac.kr/~maeng/cs610/index05.html
Objective of this course • In-depth understanding of the design and engineering of modern parallel computers • technology forces • experience of parallel programming • fundamental architectural issues • naming, replication, communication, synchronization • basic designtechniques • cache coherence, protocols, networks, pipelining, … • methods of evaluation • underlying engineering trade-offs • from moderate to very large scale
Introduction • What is Parallel Architecture? • Why Parallel Architecture? • Read chapter 1 • Evolution and Convergence of Parallel Architectures • Fundamental Design Issues
What is Parallel Architecture? • Where is the parallelism in the computer system?
What is Parallel Architecture? • A parallel computer is a collection of processing elements that cooperate to solve large problems fast (Almasi and Gottlieb 1989) • Some broad issues: • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale?
Role of a computer architect • Maximum performance and programmability within limits of technology and cost
Is Parallel Processing Dead?(by Hank Dietz, 1996) • Thinking Machines Corporation • Integer-only computation using Lisp – CM1 • CM2, CM5 – addition of floating-point hardware • Rather expensive machines? • Multiflow • VLIW design • Smart compiler and dumb machines • Speed-up using fine-grain parallelism, no vector computing • No newest HW technology, mistake in marketing • Portions of their compiler technology – Intel and HP
Is Parallel Processing Dead? – cont’d • Myrias • Canadian company • Shared-memory programming model implemented by Page fault mechanisms using conventional message-passing hardware • This technology is now becoming important • A lot of unpleasant performance surprise • Kendall Square Research • Bright architectural idea – custom cache coherence hardware • Custom processors vs. commodity microprocessors • Little cardboard models – cute, but didn’t really inspire confidence
Is Parallel Processing Dead? – cont’d • Cray • Cray Computer – died • Vector and shared memory high-end computing • Cray Research, Inc. – subsidiary of SGI • Attempt to branch into lower-end machines • Not optimized for that kind of market • nCUBE • Custom VLSI processors and hypercube interconnection • Teamed up with Oracle • Multimedia server • Larger markets, less depending if floating point speed
Is Parallel Processing Dead? – cont’d • MasPar • SIMD processing elements in a custom VLSI • Didn’t give the system the peak “macho MFLOPS” to capture the interest of many people • Canceling MP3 • NeoVista – data mining software company • DEC and HP – compaq
Several lessons to be earned from …. • Parallel processing companies may have died, but their ideas largely prospered • Need of research on large-scale parallel processing system • Too much custom “stuff” makes the product too late to market • Parallel commercial computing is larger, more stable market than scientific computing • Parallel processing is NOT DEAD
Is Parallel Computing Inevitable? • Application demands • Technology Trends • Architecture Trends • Economics • Current trends: • Today’s microprocessors have multiprocessor support • Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!... • Cluster computing, GRID computing • Tomorrow’s microprocessors are multiprocessors
www.top500.org Cray X1
New Applications More Performance Application Trends • Application demand for performance fuels advances in hardware, which enables new applications, which... • Range of performance demands • Need range of system performance with progressively increasing cost
Learning Curve for Parallel Applications • AMBER molecular dynamics simulation program on Intel Paragon • Starting point was vector code for Cray-1 • 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon • 891 on 128-processor Cray T3D
Technology Trends • The natural building block for multiprocessors is now also about the fastest!
180 160 140 DEC 120 alpha Integer FP 100 IBM HP 9000 80 RS6000 750 60 540 MIPS MIPS 40 M2000 Sun 4 M/120 20 260 0 1987 1988 1989 1990 1991 1992 General Technology Trends • Microprocessor performance increases 50% - 100% per year • Transistor count doubles every 3 years • DRAM size quadruples every 3 years • Huge investment per generation is carried by huge commodity market • Not that single-processor performance is plateauing, but that parallelism is a natural way to improve it.
Proc $ Interconnect Technology: A Closer Look • Basic advance is decreasing feature size ( ) • Circuits become either faster or lower in power • Die size is growing too • Clock rate improves roughly proportional to improvement in • Number of transistors improves like (or faster) • Performance > 100x per decade; clock rate 10x, rest transistor count • How to use more transistors? • Parallelism in processing • multiple operations per cycle reduces CPI • Locality in data access • avoids latency and reduces CPI • also improves processor utilization • Both need resources, so tradeoff • Fundamental issue is resource distribution, as in uniprocessors
SiO2 산화막(약 0.6 micron) P type silicon gate oxide(약 0.05 micron) AS이온 주입 Source, drain 영역 형성 n+ n+ NMOS Inverter NMOS Transistor(NMOS FET) polysilicon(Low Pressure Chemical Vapor Deposition 으로 얹음)
산화막 성장 n+ n+ n+ n+ 2l l NMOS Inverter contact 부분 식각후 aluminum 증착, 패턴 형성 Length unit --- l (micron)
Clock Frequency Growth Rate Intel P4 2.2 GHz (2002) Intel Xeon 3.2GHz (2004) Intel Pentium III 500MHz Alpha 21264 600 MHz 21364 1.2GHz UltraSparc II 480MHz IV 1GHZ (2001) UltraSparc IV 1.2GHz (2003) • 30% per year
Transistor Count Growth Rate Alpha 21264 15.2 M 21364 100M (8M+92M) UltraSparc II 5.4M 1 Billion Trs in 2010 Pentium 4 55M In 2002 Itanium 2 (1.5GHz) 221M In 2003 • 100 million transistors on chip by early 2000’s A.D. • Transistor count grows much faster than clock rate • - 40% per year, order of magnitude more contribution in 2 decades
Similar Story for Storage • Divergence between memory capacity and speed more pronounced • Capacity increased by 1000x from 1980-95, speed only 2x • Gap with processor speed much greater • Larger memories are slower, while processors get faster • Need to transfer more data in parallel • Need deeper cache hierarchies • How to organize caches? • Parallelism increases effective size of each level of hierarchy, without increasing access time • Parallelism and locality within memory systems too • New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface • Buffer caches most recently accessed data • Disks too: Parallel disks plus caching
Architectural Trends • Architecture translates technology’s gifts to performance and capability • Resolves the tradeoff between parallelism and locality • Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect • Tradeoffs may change with scale and technology advances • Understanding microprocessor architectural trends • Helps build intuition about design issues or parallel machines • Shows fundamental role of parallelism even in “sequential” computers • Four generations of architectural history: tube, transistor, IC, VLSI • Here focus only on VLSI generation
Architectural Trends • Greatest trend in VLSI generation is increase in parallelism • Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit • slows after 32 bit • adoption of 64-bit now under way, 128-bit far (not performance issue) • great inflection point when 32-bit micro and cache fit on a chip • Mid 80s to mid 90s: instruction level parallelism • pipelining and simple instruction sets, + compiler advances (RISC) • on-chip caches and functional units => superscalar execution • greater sophistication: out of order execution, speculation, prediction • to deal with control transfer and latency problems • Next step: thread level parallelism
Phases in VLSI Generation • How good is instruction-level parallelism? • Thread-level needed in microprocessors?
Architectural Trends: ILP • Reported speedups for superscalar processors • Horst, Harris, and Jardine [1990] ...................... 1.37 • Wang and Wu [1988] .......................................... 1.70 • Smith, Johnson, and Horowitz [1989] .............. 2.30 • Murakami et al. [1989] ........................................ 2.55 • Chang et al. [1991] ............................................. 2.90 • Jouppi and Wall [1989] ...................................... 3.20 • Lee, Kwok, and Briggs [1991] ........................... 3.50 • Wall [1991] .......................................................... 5 • Melvin and Patt [1991] ....................................... 8 • Butler et al. [1991] ..........................................… 17+ • Large variance due to difference in • application domain investigated (numerical versus non-numerical) • capabilities of processor modeled
ILP Ideal Potential • Infinite resources and fetch bandwidth, perfect branch prediction and renaming • real caches and non-zero miss latencies
Results of ILP Studies • Concentrate on parallelism for 4-issue machines • Realistic studies show only 2-fold speedup • Recent studies show that more ILP needs to look across threads • “Billion-Transistor Architectures” IEEE Computer, September 1997
Proc Proc Proc Proc MEM Threads Level Parallelism “on board” • Micro on a chip makes it natural to connect many to shared memory • dominates server and enterprise market, moving down to desktop • Faster processors began to saturate bus, then bus technology advanced • today, range of sizes for bus-based systems, desktop to large servers
70 CRA Y CS6400 l l Sun 60 E10000 50 40 SGI Challenge l Number of processors Sequent B2100 Symmetry81 SE60 Sun E6000 30 l l l l l SE70 SC2000E 20 Sun SC2000 l l SGI PowerChallenge/XL l AS8400 Symmetry21 Sequent B8000 l l SE10 SE30 10 l l l Power SS1000E SS1000 l l l SS690MP 140 AS2100 HP K400 P-Pro l l l l SGI PowerSeries SS10 SS20 SS690MP 120 l l l l 0 1986 1988 1990 1992 1994 1996 1998 1984 Architectural Trends: Bus-based MPs
100,000 Sun E10000 10 GB l 10,000 SGI Sun E6000 l PowerCh AS8400 XL l CS6400 SGI Challenge l l l 1 GB HPK400 1,000 l SC2000E l SC2000 l l AS2100 P-Pro l SS1000E l Shared bus bandwidth (MB/s) SS1000 l SS20 SS690MP 120 l SS10/ SE70/SE30 l l l l l l l SS690MP 140 SE10/ SE60 Symmetry81/21 100 l Power SGI PowerSeries l l Sequent B2100 l l Sequent B8000 10 1986 1988 1990 1992 1994 1996 1998 1984 Bus Bandwidth
Interconnection Networks • Gigabit Ethernet • Myrinet : 1.2 Gbps • InfiniBand • 250 MB/sec to 3GB/sec for unidirectional bandwidth • 500 MB/sec to 6GB/sec for bi-directional bandwidth • What is the difference between the bus and the networks?
Economics • Commodity microprocessors not only fast but cheap • Development cost is tens of millions of dollars (5-100 typical) • BUT, many more are sold compared to supercomputers • Crucial to take advantage of the investment, and use the commodity building block • Exotic parallel architectures no more than special-purpose • Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors • Standardization by Intel makes small, bus-based SMPs commodity • Desktop: few smaller processors versus one larger one? • Multiprocessor on a chip • Cluster Computing
Consider Scientific Supercomputing • Proving ground and driver for innovative architecture and techniques • Market smaller relative to commercial as MPs become mainstream • Dominated by vector machines starting in 70s • Microprocessors have made huge gains in floating-point performance • high clock rates • pipelined floating point units (e.g., multiply-add every cycle) • instruction-level parallelism • effective use of caches (e.g., automatic blocking) • Plus economics • Large-scale multiprocessors replace vector supercomputers • Well under way already • Top-500 Supercomputers
Raw Uniprocessor Performance: LINPACK 10,000 CRA Y n = 1,000 n CRA Y n = 100 s Micro n = 1,000 l Micro n = 100 u n 14CPU, 28GB, 85 TB of Disk 1,000 n T94 s C90 l s n DEC 8200 l l n Ymp n Xmp/416 s ACK (MFLOPS) l u u s l IBM Power2/990 l 100 u MIPS R4400 Xmp/14se s DEC Alpha l LINP l u HP9000/735 u DEC Alpha AXP u HP 9000/750 CRA Y 1s u n s IBM RS6000/540 u 10 l MIPS M/2000 l u MIPS M/120 u Sun 4/260 u l 1 1975 1980 1985 1990 1995 2000
10,000 MPP peak l CRA Y peak n ASCI Red 1,000 l Paragon XP/S MP (6768) l Paragon XP/S MP (1024) l n T3D ACK (GFLOPS) CM-5 l 100 T932(32) n Paragon XP/S LINP l CM-200 l C90(16) l n CM-2 l Delta 10 iPSC/860 l n nCUBE/2(1024) l Ymp/832(8) 1 n Xmp /416(4) 0.1 1985 1987 1989 1991 1993 1995 1996 Raw Parallel Performance: LINPACK • Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32) • Since 1993, Cray produces MPPs too (T3D, T3E)
350 319 313 u n 284 300 u 239 250 MPP u u PVP n 200 198 n SMP u s 187 Number of systems 150 110 106 s n n 100 106 s s 73 50 63 0 s 11/93 11/94 11/95 11/96 500 Fastest Computers
Summary: Why Parallel Architecture? • Increasingly attractive • Economics, technology, architecture, application demand • Increasingly central and mainstream • Parallelism exploited at many levels • Instruction-level parallelism • Multiprocessor servers • Large-scale multiprocessors (“MPPs”) • Focus of this class: multiprocessor level of parallelism • Same story from memory system perspective • Increase bandwidth, reduce average latency with many local memories • Wide range of parallel architectures make sense • Different cost, performance and scalability