570 likes | 921 Views
Supercomputers. Special Course of Computer Architecture H.Amano. Contents. What are supercomputers? Architecture of Supercomputers Representative supercomputers Exa-Scale supercomputer project. Defining Supercomputers. High performance computers mainly for scientific computation.
E N D
Supercomputers Special Course of Computer Architecture H.Amano
Contents • What are supercomputers? • Architecture of Supercomputers • Representative supercomputers • Exa-Scale supercomputer project
Defining Supercomputers • High performance computers mainly for scientific computation. • Huge amount of computation for Biochemistry, Physics, Astronomy, Meteorology and etc. • Very expensive: developed and managed by national fund. • High level techniques are required to develop and manage them. • USA, Japan and China compete the top 1 supercomputer. • A large amount of national fund is used, and tends to be political news→ In Japan, the supercomputer project became the target of budget review in Dec. 2009 「K」 achieved 10PFLOPS, and became the top 1 in the last year, but Sequoia got back in the last month.
FLOPS • Floating Point Operation Per Second • Floating Point number • (Mantissa)× 2 (index) • Double precision 64bit, Single precision 32bit. • IEEE Standard defines the format and rounding index mantissa sign Single 8 23 Double 11 52
The range of performance 106 109 1012 1015 1018 100万 M(Mega) 10億 G(Giga) 1兆 T(Tera) 1000兆 P(Peta) 100京 E(Exa) Supercomputers 10TFLOPS-16PFLOPS iPhone4S 140MFLOPS High-end PC 50-80GFLOPS Powerful GPU Tera-FLOPS growing ratio: 1.9times/year 10PFLOPS=1京回 in Japanese →The name 「K」 comes from it.
How to select top 1? • Top500/Green500: Performance of executing Linpack • Linpack is a kernel for matrix computation. • Scale free • Performance centric. • Godon Bell Prize • Peak Performance, Price/Performance, Special Achievement • HPC Challenge • Global HPLMatrix computation: Computation • Global Random Access: random memory access: Communication • EP stream per system:heavy load memory access: Memory performance • Global FFT: Complicated problem requiring both memory and communication performance. • Nov.ACM/IEEE Supercomputing Conference • Top500、Gordon Bell Prize、HPCChallenge、Green500 • Jun.International Supercomputing Conference • Top500、Green500
Rmax: Peta FLOPS Sequoia USA 16PFLOPS 10 K Japan Top 5 9 8 3 Tianhe(天河)China 2 Jaguar USA Nebulae China 1 Roadrunner USA Tsubame Japan Kraken USA Jugene Germany 2011.11 2011.6 2010.11 2010.6
Green 500 201111月 IBM Blue Gene/Q got 1-5 10位はTsubame-2.0(東工大)
Why Top1? • Top1 is just a measure of matrix computation. • Top1 of Green500, Gordon Bell Prize, Top1 of each HPC Challenge program →All machines are valuable. TV or newspapers are too much focus on Top 500. • However, most top 1 computer also got Gordon Bell Prize and HPC Challenge top1. • K and Sequoia • Impact of Top 1 is great!
Why supercomputers so fast? × Because they use high freq. clock Pentium4 3.2GHz Nehalem 3.3GHz Clock freq. of High end PC Freq. K 2GHz Sequoia 1.6GHz 1GHz 40% / year The speed up of the clock is saturated in 2003. Power and heat dissipation Alpha21064 150MHz The clock frequency of K and Sequoia is lower than that of common PCs 100MHz 2008 1992 2000 年
Major 3 methods of parallel processing in supercomputers Supercomputer = massively parallel computers • SIMD(Single Instruction Stream Multiple Data Streams) • Most accelerators • Pipelined processing • Vector computers • MIMD(Multiple Instruction Streams Multiple Data Streams): • Homogeneous (vs. Accelerators), Scalar (vs. Vector machines) • Although all supercomputers use three methods in various level, it can be classified by its usage. Key issues other than computational nodes Large high bandwidth memory Large disk High speed Interconnection Networks.
Instruction SIMD (Single Instruction StreamMultiple Data Streams Instruction Memory • All Processing Units executes the same instruction • Low degree of flexibility • Illiac-IV/MMX instructions/ClearSpeed/IMAP/GP-GPU(coarse grain) • CM-2,(fine grain) Processing Unit Data memory
TSUBAME2.0(Xeon+Tesla,Top500 2010/114th) 天河一号(Xeon+FireStream,2009/11 5th) GPGPU(General-Purpose computing on Graphic ProcessingUnit) ※()内は開発環境
GeForce GTX280 240 cores Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors … PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM Load/Store Global Memory
GPU(NVIDIA’s GTX580) 128 Cores 128 Cores L2 Cache 128 Cores 128 Cores 512 GPU cores ( 128 X 4 ) 768 KB L2 cache 40nm CMOS 550 mm^2
L2 C L1 C SPE PXU PPE SXU SXU SXU SXU SXU SXU SXU SXU IOIF1 LS LS LS LS LS LS LS LS DMA DMA DMA DMA DMA DMA DMA DMA 1.6GHz / 4 X 16B data rings BIF/ IOIF0 MIC Cell Broadband Engine PS3 IBM Roadrunner Common platform for supercomputers and games
Peta FLOPS Peak performance vs Linpack Performance 11 K Japan 10 5 The difference is large in machines with accelerators 4 Tianhe(天河)China Homogeneous Using GPU 3 Nebulae China 2 Tsubame Japan Jaguar USA 1 Accelerator type is energy efficient.
Pipeline processing 1 2 3 4 5 6 Stage Each stage sends the result/receives the input every clock cycle. N stages = N times performance Data dependency makes RAW hazards and degrades the performance. If the large array is treated, a lot of stages can work efficiently.
Vector computers vector registers a0a1a2….. multiplier adder X[i]=A[i]*B[i] Y=Y+X[i] b0b1b2…. The classic style supercomputers since Cray-1. Earth simulator may be the last vector supercomputer.
Vector computers vector registers a1a2….. multiplier adder a0 b0 X[i]=A[i]*B[i] Y=Y+X[i] b1b2….
Vector computers vector registers a2….. multiplier a1 adder a0 b0 b1 X[i]=A[i]*B[i] Y=Y+X[i] b2….
Vector computers vector registers a11….. multiplier a10 adder x1 a9 x0 b9 b10 X[i]=A[i]*B[i] Y=Y+X[i] b11….
MIMD(Multipe-Instruction Streams/Multiple-Data Streams) • Multiple processors (cores) can work independently. • Synchronization mechanism • Data communication: Shared memory • All supercomputers are MIMD with multiple cores. • However, K and Sequoia (BlueGene Q) are typical massively parallel MIMD machines. • homogeneous computers • scalar processors
MIMD(Multipe-Instruction Streams/Multiple-Data Streams) Node 0 0 Node 1 1 2 Interconnection Network Node2 3 Shared Memory Node 3 Processors which can work independently.
Multi-Core (Intel’s Nehalem-EX) L3 Cache CPU CPU CPU CPU CPU CPU L3 Cache CPU CPU 8 CPU cores 24MB L3 cache 45nm CMOS 600 mm^2
Intel 80-core chip [Vangal,ISSCC’07] Intel 80-Core Chip
How to program them? • Can the common programs for PC be accelerated on supercomputers? • Yes, a certain degree by parallel compilers. • However, in order to efficient use of many cores, specialists must optimize programs. • Multithread using MPIs • Open MP • Open CL/CUDA →GPU accelerator type
The fastest computer Also simple NUMA From IBMweb site
IBM’s BlueGene Q • Successor of Blue Gene L and Blue Gene P. • Sequoia is consisting of BlueGene Q • 18 Power processors (16 computational, 1 control and 1 redundant) and network interfaces are provided in a chip. • Inner-chip interconnection is a cross-bar switch. • 5 dimensional Mesh/Torus • 1.6GHz clock.
Japanese supercomputers • K-Supercomputer • Homogeneous scalar type massively parallel computers. • Earth simulator • Vector computers • The difference between peak and Linpack performance is small. • TIT’s Tsubame • A lot of GPUs are used. Energy efficient supercomputer. • Nagasaki University’sDEGIMA • A lot of GPUs are used. Hand made supercomputer. High cost-performance. Gordon Bell prize cost performance winner • GRAPE projects • For astronomy, dedicated supercomputers. SIMD、Various version won the Gordon Bell prize.
Supercomputer 「K」 Memory L2 C Core Core Tofu Interconnect 6-D Torus/Mesh Core Core Inter Connect Controller Core Core Core Core SPARC64 VIIIfx Chip RDMA mechanism NUMA or UMA+NORMA 4 nodes/board 96nodes/Lack 24boards/Lack
6 dimensional torus Tofu
2 2 2 1 1 1 0 0 0 2 2 2 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 2 1 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1 1 1 2 2 2 1 1 1 1 2 2 1 1 1 3-ary 3-cube 0 2 2 2 0 0 0 0 2 2 2 0 2 2 2 2 2 2 1 1 1 3-dimensional mesh 3-ary1-cube 3-ary2-cube
1*** 2*** 4 dimensional mesh 0***
Why K could get top 1 • The delay of BlueGeneQ/Sequoia • Financial crisis in USA • Withdrawal of NEC/Hitachi • As starting, the complex system of a vector machine and a scalar machine was planned. • All budget can be used only for scalar machine. • Budget reviewing made the project famous. • Enough fund was thrown in short period. • Engineers in Fujitsu did really good job.
Shared Memory 16GB Shared Memory 16GB Shared Memory 16GB Vector Processor Vector Processor Vector Processor Vector Processor Vector Processor Vector Processor Vector Processor Vector Processor Vector Processor … … … 0 0 0 1 1 1 7 7 7 The earth simulator Peak performance 40TFLOPS Interconnection Network (16GB/s x 2) …. Node 0 Node 1 Node 639
TIT’s Tsubame Well balanced supercomputer with GPUs
Nagasaki Univ’s DEGIMA
GRAPE-DR Kei Hiraki “GRAPE-DR” http://www.fpl.org (FPL2007)
Exa-scale computer • Japanese national project for exa-scale computer started. • FeasibilityStudy started. • U. Tokyo, Tsukuba Univ. Tohoku Univ. and Riken. • It is difficult to produce supercomputers with Japanese original chips. • In Japan, a vendor suffers loss for developing supercomputers. • The vendor may retrieve development fee later by selling smaller systems. • However, Japanese semiconductor companies will not be able to support a big money for development. • If Intel’s CPUs or NVIDIA’s GPUs are used, a huge national money will flow to US companies. • For exa-scale: 70,000,000 cores are needed. • The limitation of budget is severer than technical limit.
Amdahl’s law Serial part 1% Parallel part 99% Accelerated by parallel processing 0.01 + 0.99/p 50 times with 100 cores、91 times with 1000 cores If there is a small part of serial execution part, the performance improvement is limited.