400 likes | 600 Views
Case Study: Blue Gene P. Successor of Blue Gene L Largest machine at Lawrence Livermore National Lab Available since end of 2008. BGP Chip. Roadrunner. Jaguar: A Cray XT4. Installed at DOE Oak Ridge National Lab Position 5 on TOP500 http://www.top500.org/list/2008/06/100
E N D
Case Study: Blue Gene P • Successor of Blue Gene L • Largest machine at Lawrence Livermore National Lab • Available since end of 2008
Jaguar: A Cray XT4 • Installed at DOE Oak Ridge National Lab • Position 5 on TOP500 • http://www.top500.org/list/2008/06/100 • Peak 260 TFlops, Linpack 205 TFlops • 7,832 quad-core 2.1 GHz AMD Opteron processors and 62 TB of memory (2 GB per core).
Cray XT4 Node 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 6.4 GB/sec direct connect HyperTransport • 4-way SMP • >35 Gflops per node • Up to 8 GB per node • OpenMP Support within socket 2 – 8 GB 9.6 GB/sec 12.8 GB/sec direct connect memory(DDR 800) CraySeaStar2+Interconnect
Cray XT5 Node 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 2 – 32 GB memory 6.4 GB/sec direct connect HyperTransport • 8-way SMP • >70 Gflops per node • Up to 32 GB of shared memory per node • OpenMP Support 25.6 GB/sec direct connect memory CraySeaStar2+Interconnect
Cray XT6 node 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 6.4 GB/sec direct connect HyperTransport • 2 eight or 12-core AMD Opteron 6100 • >200 Gflops per node • Up to 64 GB of shared memory per node • OpenMP Support 83.5 GB/sec direct connect memory CraySeaStar2+Interconnect
XT6 NodeDetails: 24-core Magny Cours • 2 Multi-Chip Modules, 4 Opteron Dies • 8 Channels of DDR3 Bandwidth to 8 DIMMs • 24 (or 16) Computational Cores, 24 MB of L3 cache • Dies are fully connected with HT3 Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound DDR3 Channel Greyhound Greyhound DDR3 Channel Greyhound Greyhound Greyhound Greyhound DDR3 Channel HT3 DDR3 Channel Greyhound Greyhound Greyhound Greyhound Greyhound Greyhound DDR3 Channel Greyhound Greyhound DDR3 Channel Greyhound Greyhound Greyhound Greyhound DDR3 Channel DDR3 Channel 6MB L3 Cache 6MB L3 Cache HT3 HT3 HT3 6MB L3 Cache 6MB L3 Cache HT3 HT1 / HT3 To Interconnect
Cray SeaStar2+ Interconnect Cray XT6 systems ship with the SeaStar2+ interconnect Custom ASIC Integrated NIC / Router MPI offload engine Connectionless Protocol Link Level Reliability Proven scalability to 225,000 cores DMAEngine HyperTransport Interface 6-PortRouter Memory BladeControlProcessorInterface PowerPC440 Processor
Cray Network Evolution • SeaStar • Built for scalability to 250K+ cores • Very effective routing and low contention switch • Gemini • 100x improvement in message throughput • 3x improvement in latency • PGAS Support, Global Address Space • Scalability to 1M+ cores • Aries • Ask me about it
Scalability with Efficient Cooling Vertical Cooling Eliminates hot and cold aisles Graduated fin design Delivers constant operating temperature Laminar flow unobstructed by board logic 42 fins 37 fins 24 fins 18 fins Air Flow
System Layout Compute node Login node Network node Boot/Database nodes I/O and Metadata nodes
K-Computer • Kobe, Japan • Riken Research Center • 10 Petaflop computer • Implemented by Fujitsu
Specification • SPARC 64 VIII • 256 FP registers • SIMD extension with 2 SP or DP operations • 80.000 nodes • Entire system will be available in November 2012. • 640.000 cores • Performance 10 PFLOPS • Memory 1PB (16 GB per node) • 9.89 MW power consumption (~10.000 suburban homes), 824.6 GFlop/kWatt
K-Computer Network • 6D network, called Tofu (torus fusion) network • Torus network • It can be divided into arbitrary rectangular chunks each one with a 3D torus topology. This allows good utilization (no fragmentation) and good application performance. • It even provides a 3D torus in case of a faulty node. • 10 GB/s bidrectional links • Entry bandwidth is 100 GB/s • Flexible packet length from 32 to 2048 bytes to reduce protocol overhead • Virtual cut-through routing, i.e. packet buffering only if a link is blocked. • Hardware support for collectives • Barriers and reductions
European Exascale Projects (1018 FLOP/S) • Project towards exascale • DEEP • Started in December 2011, Research Centre Jülich • Collaboration with Intel, ParTech, and 8 other partners • Xeon cluster with 512 Knights Corner Booster • Knights Corner is the first processor in Intel‘s MIC series • Started in October 2011, Barcelona Supercomputing Center • Collaboration with ARM and Bull and 6 other partners • Led by EPCC (Edinburgh Parallel Computing Center) • Collaboration with Cray • Focusing on applications and system software
Summary • Scalable machines are distributed memory architectures • Examples • Blue Gene P • Roadrunner • Cray XT4/5/6 • K-Computer