470 likes | 634 Views
HPC Achievement and Impact – 2012 a personal perspective. Thomas Sterling Professor of Informatics and Computing Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST) Pervasive Technology Institute at Indiana University
E N D
HPC Achievement and Impact – 2012 a personal perspective Thomas Sterling Professor of Informatics and Computing Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST) Pervasive Technology Institute at Indiana University Fellow, Sandia National Laboratories CSRI June 20, 2012
A continuing tradition at ISC (9thyear, and still going at it) As always, a personal perspective – how I’ve seen it Highlights – the big picture But not all the nitty details, sorry Necessarily biased, but not intentionally so Iron-oriented, but software too Trends and implications for the future And a continuing predictor: The Canonical HEC Computer HPC Year in Review • Previous Years’ Themes: • 2004: “Constructive Continuity” • 2005: “High Density Computing” • 2006: “Multicore to Petaflops” • 2007: “Multicore: the Next Moore’s Law” • 2008: “Run-up to Petaflops” • 2009: “Year 1 A. P. (After Petaflops)“ • 2010: “Igniting Exaflops” • 2011: “Petaflops: the New Norm” • This Year’s Theme: ?
International across the high end Reversal of power growth – green screams! GPU increases in HPC community Everybody is doing software (apps or systems) for them But … resurgence of multi-core IBM BG Intel MIC (oops I mean Xeon Phi) Exaflops takes hold of international planning Big data moves to a driving command Commodity Linux clusters still the norm among the common computer Trends in Highlight - 2012
A continuing tradition at ISC (9thyear, and still going at it) As always, a personal perspective – how I’ve seen it Highlights – the big picture But not all the nitty details, sorry Necessarily biased, but not intentionally so Iron-oriented, but software too Trends and implications for the future And a continuing predictor: The Canonical HEC Computer HPC Year in Review • Previous Years’ Themes: • 2004: “Constructive Continuity” • 2005: “High Density Computing” • 2006: “Multicore to Petaflops” • 2007: “Multicore: the Next Moore’s Law” • 2008: “Run-up to Petaflops” • 2009: “Year 1 A. P. (After Petaflops)“ • 2010: “Igniting Exaflops” • 2011: “Petaflops: the New Norm” • This Year’s Theme: • 2012: “The Million Core Computer”
Processors: Intel • Intel’s 22nm processors: Ivy Bridge die shrink, Sandy Bridge microarchitecture, based on tri-gate (3D) transistors. • Die size of 160 mm2 with 1.4 billion transistors • Intel Xeon Processor E7-8870: • 10 cores, 20 threads • 112 GFLOPS max turbo single core • 2.4 GHz • 30 MB cache • 130 W max TDP • 32 nm • Supports up to 4 TB of DDR3 memory • Next generation 14nm Intel processors are planned for 2013 http://download.intel.com/support/processors/xeon/sb/xeon_E7-8800.pdf http://download.intel.com/newsroom/kits/22nm/pdfs/22nm-Details_Presentation.pdf http://www.intel.com/content/www/us/en/processor-comparison/processor-specifications.html?proc=53580
Processors: AMD • AMD’s 32nm processors: Trinity • Die size of 246 mm2 with 1.3 billion transistors • AMD Opteron 6200 Series: • 16 cores • 2.7 GHz • 32 MB cache • 140 W max TDP • 32 nm • 239.1 GFLOPS 2 x AMD Opteron 6276 processors Linpack benchmark(2P) http://www.amd.com/us/Documents/Opteron_6000_QRG.pdf http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope
Processors: IBM • IBM Power7 • Released in 2010 • The main processor in the PERCS • Power7 Processor • 45 nm SOI Process, 567 mm2, 1.2 billion transistors • 3.0-4.25 GHz clock speed, 4 chips per quad module, • 4,6,8 cores per chip, 64 kB L1, 256 kB L2, 32 MB L3 cache / core • Max 33.12 GFLOPS per core, 264.96 GFLOPS per chip • IBM Power8 • Successor to Power7 currently under development. Anticipated imporved SMT, reliability, larger cache size, more cores. Will be built on 22nm process. http://www-03.ibm.com/systems/power/hardware/710/specs.html http://arstechnica.com/gadgets/2009/09/ibms-8-core-power7-twice-the-muscle-half-the-transistors/
Processors: Fujitsu SPARC IXfx • Introduced in November 2011 • Used in the PRIMEHPC FX10 (an HPC system that is a follow up to Fujitsu’s K computer) • Architecture: • SPARC v9 ISA extended for HPC with increased amounts of registers • 16 cores • 236.5 GFLOPS peak performance • 12 MB shared L2 cache • 1.85 GHz • 115 W TDP http://img.jp.fujitsu.com/downloads/jp/jhpc/primehpc/primehpc-fx10-catalog-en.pdf
Many Integrated Core (MIC) Architecture – Xeon Phi • Multiprocessor architecture • Prototype products (named Knights Ferry) were announced and released in 2010 to European Organization for Nuclear Research, Korea Institute of Science and Technology Information, and Leibniz Supercomputing Center among others • Knights Ferry prototype: • 45 nm process • 32 in order cores with up to 1.2 GHz with 4 threads per core • 2 GB GDDR5 memory, 8MB L2 cache • Single board performance exceeds 750 GFLOPS • Commercial release (codenamed Knights Corner)to be built on a 22nm process is proposed for release in 2012-2013 • Texas Advanced Computing Center (TACC) will use Knights Corner cards in 10 PetaFLOPS “Stampede” supercomputer http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html http://inside.hlrs.de/htm/Edition_02_10/article_14.html
Processors: ShenWei SW1600 • Third generation CPU by Jiāngnán Computing Research Lab • Specs: • Frequency -- 1.1 GHz • 140 GFLOPS • 16 core RISC architecture • Up to 8TB virtual memory and 1TB physical memory supported • L1 cache: 8KB, L2 cache: 96KB • 128-bit system bus http://laotsao.wordpress.com/2011/10/29/sunway-bluelight-mpp-%E7%A5%9E%E5%A8%81%E8%93%9D%E5%85%89/
Processors: ARM • Industry leading provider of 32-bit embedded microprocessors • Wide portfolio of more than 20 processors is available • Applications processors are capable of executing a wide range of OSs: Linux, Android/Chrome, Microsoft Windows CE, Symbian and others • NVIDIA: The CUDA on ARM Development Kit • A high performance energy efficient kit featuring • NVIDIA Tegra 3 Quad-Core ARM A9 CPU • CPU Memory: 2 GB • 270 GFLOPS Single Precision Performance
GPUs: NVIDIA • Current: Kepler Architecture – GK110 • 7.1 billion transistors • 192 CUDA cores • 32 Special Function Units • 32 Load/Store Units • More than 1 TFLOPS • Upcoming: Maxwell Architecture – 2013 • ARM instruction set • 20nm Fab process ??? (expected) • 14-16 GFLOPS in double precision per watt • HPC systems with NVIDIA GPUs: • Tianhe-1 (2nd on TOP 500): 7,168 Nvidia Tesla M2050 general purpose GPUs • Titan (currently 3rd on TOP500): 960 NVIDIA GPUs • Nebulae (4th on TOP500): 4640 NVIDIA Tesla C2050 GPUs • Tsubame 2.0 (5th on TOP500): 4224 NVIDIA Tesla M2050; 34 NVIDIA Tesla S1070; http://www.xbitlabs.com/news/cpu/display/ 20110119204601_Nvidia_Maxwell_Graphics_Processors_to_Have_Integrated_ARM_General_Purpose_Cores.html
GPUs: AMD • Current: AMD Radeon HD 7970 (Southern Islands HD7xxx series) • Released in January 2012 • 28 nm Fab process • 352 mm2 die size with 4.3 billion transistors • Up to 925 MHz engine clock • 947 GFLOPS double precision compute power • 230 W TDP • Latest AMD architecture – Graphic Core Next (GCN): • 28 nm GPU architecture • Designed both for graphics and general computing • 32 compute nodes (1,048 stream processors) • Handles workloads of the processor
GPU Programming Models • Current: Cuda 4.1 • Share GPUs across multiple threads • Unified Virtual Addressing • Use all GPUs from a single host thread • Peer-to-Peer communication • Coming in Cuda 5 • Direct communication between GPUs and other PCI devices • Easily acceleratable parallel nested loops starting with Tesla K20 Kepler GPU • Current: OpenCL 1.2 • Open royalty-free standard for cross-platform parallel computing • Latest version released in November 2011 • Host-thread safety, enabling OpenCL commands to be enqued from multiple host threads • Improved OpenGL interoperability by linking OpenCL event objects to OpenGL • OpenACC • Programming standard developed by Cray, NVIDIA, CAPS and PGI • Designed to simplify parallel programming of heterogeneous CPU/GPU systems • The programming is done through some pragmas and API functions • Planned supported compilers – Cray, PGI and CAPS http://developer.nvidia.com/cuda-toolkit
IBM Blue Gene/Q • IBM PowerPC A2 @ 1.6 GHz CPU • 16 cores per node • 16 GB SDRAM-DDR3 per node • 90% water cooling 10% air cooling • 209 TFLOPS peak performance • 80 KW per rack power consumption • HPC Systems powered by Blue Gene/Q: • Mira, Argonne National Lab • 786,432 cores • 8.162 PFLOPS Rmax, 10.06633 PFLOPS Rpeak • 3rd on TOP500 (June 2012) • SuperMUC, Leibniz Rechenzentrum • 147,456 cores • 2.897 PFLOPS Rmax, 3.185 PFLOPS Rpeak • 4th on TOP500 (June 2012) http://www-03.ibm.com/systems/technicalcomputing/solutions/bluegene/ http://www.alcf.anl.gov/mira
USA: Sequoia • First on the TOP500 as of June 2012 • Fully deployed in June 2012 • 16.32475 PFLOPS Rmax and 20.13266 PLOPS Rpeak • 7.89 MW Power consumption • IBM Blue Gene/Q • 98,304 compute nodes • 1.6 million processor cores • 1.6 PB of memory • 8 or 16 core Power Architecture processors built on a 45 nm • fabrication method • Lustre Parallel File System
Japan: K computer • First on the TOP500 as of November 2011 • Located at RIKEN Advanced Institute for Computational Science in Kobe, Japan • Distributed memory architecture • Manufactured by Fujitsu • Performance: 10.51 PFLOPS Rmax and 11.2804 PFLOPS Rpeak • 12.65989 MW Power Consumption, 824.6 GFLOPS/Kwatt • Annual running costs - $10 million • 864 cabinets: • 88,128 8-core SPARC64 VIIIfx processors @ 2.0 GHz • 705,024 cores • 96 computing nodes + 6 I/O nodes per cabinet • Water cooling system minimizes failure rate and power consumption http://www.fujitsu.com/global/news/pr/archives/month/2011/20111102-02.html
China: Tianhe-1A • Tianhe-1A: • #5 on Top-500 • 2.566 PFLOPS Rmax and 4.701 PFLOPS Rpeak • 112 computer cabinets, 12 storage cabinets, 6 communication cabinets and 8 I/O cabinets: • 14,336 Xeon X5670 processors • 7,168 Nvidia Tesla M2050 general purpose GPUs • 2,048 FreiTeng 1000 SPARC-based processors • 4.04 MW • Chinese custom-designed NUDT proprietary high-speed interconnect, called Arch (160 Gbit/s) • Purpose: petroleum exploration, aircraft design • $88 million to build, $20 million to maintain annually, operated by about 200 workers
China: Sunway Blue Light • Sunway Blue Light: • 26thon the TOP500 • 795.90 TFLOPS Rmax and 1.07016 PFLOPS Rpeak • 8,704 ShenWei SW1600 processors @ 975 MHz • 150 TB main memory • 2 PB external storage • Total power consumption 1.074 MW • 9-rack water-cooled system • Infiniband Interconnect, Linux OS
USA: Intrepid • Located in Argonne National Laboratory • IBM Blue Gene/P • 23rd in TOP500 list • 40 racks • 1024 nodes per rack • 850 MHz quad-core processors • 640 I/O nodes • 458 TFLOPS Rmax and 557 TFLOPS Rpeak • 1260 KW
Russia: Lomonosov • Located at Moscow State University Research Computer Center • Most powerful HPC system in Eastern Europe • 1.373 PFLOPS Rpeak and and 674.11 TFLOPS Rmax (TOP500) • 1.7 PFLOPS Rpeak and 901.9 TFLOPS Rmax (according to the system’s website) • 22nd on the TOP500 • 5,104 CPU nodes (T-Platforms T-Blade2 system) • 1,065 GPU nodes (T-Platforms TB2-TL) • Applications: regional and climate research, nanoscience, protein simulations http://www.t-platforms.com/solutions/lomonosov-supercomputer.html http://www.msu.ru
Cray Systems • Cray XK6 • Cray’s Gemini interconnect, AMD’s multicore scalar processors, and NVIDIA’s GPU processors • Scalable up to 50 Pflops of combined performance • 70 Tflops per cabinet • 16 core AMD Opteron 6200 (96 /cabinet) 1536 cores/cabinet • NVIDIA Tesla 2090 (96 per cabinet) • OS: Cray Linux Environment • Cray XE6 • Cray’s Gemini interconnect, AMD’s Opteron 6100 processors • 20.2 Tflops per cabinet • 12 core AMD Opteron 6100 (192/cabinet) 2034 cores/cabinet • 3D torus interconnect
ORNL’s “Titan” System • Upgrade of Jaguar from Cray XT5 to XK6 • Cray Linux Environment operating system • Gemini interconnect • 3-D Torus • Globally addressable memory • Advanced synchronization features • AMD Opteron 6274 processors (Interlagos) • New accelerated node design using NVIDIA multi-core accelerators • 2011: 960 NVIDIA x2090 “Fermi” GPUs • 2012: 14,592 NVIDIA “Kepler” GPUs • 20+ PFlops peak system performance • 600 TB DDR3 mem. + 88 TB GDDR5 mem 26
USA: Blue Waters • Petascale supercomputer to be deployed at NCSA at UIUC • IBM terminated the contract in August 2011 • Cray was chosen to deploy the system • Specifications: • Over 300 cabinets (Cray XE6 and XK6) • Over 25,000 compute nodes • 11.5 PFLOPS peak performance • Over 49,000 AMD processors (380,000 cores) • Over 3,000 NVIDIA GPUs • Over 1.5 PB overall system memory (4 GBper core)
UK: HECToR • Located in University of Edinburgh in Scotland • 32nd on the TOP500 list • 660.24 TFLOPS Rmax and 829.03 TFLOPS Rpeak • Currently 30 cabinets • 704 compute blades • Cray XE6 system • 16-core Interlagos Opteron @ 2.3 GHz • 90,112 cores total http://www.epcc.ed.ac.uk/news/hector-leads-the-way-the-world%E2%80%99s-first-production-cray-xt6-system http://www.hector.ac.uk/
Germany: HERMIT • Located at High Performance Computing Center Stuttgart • 24th on the TOP500 list • 831.40 TFLOPS Rmax and 1043.94 TFLOPS Rpeak • Currently 38 cabinets • Cray XE6 • 3552 compute nodes, 113,664 compute cores • Dual socker AMD Interlagos @ 2.3 GHz (16 cores) • 2 MW max power consumption • Research at HPCCS: • OpenMP Validation Suite, MPI-Start, Molecular Dynamics, Biomechanical simulations, Microsoft Parallel Visualization, Virtual Reality, Augmented reality http://www.hlrs.de/
NEC SX-9 • Supercomputer, an SMP system • Single-chip vector processor: • 3.2 GHz • 8-way replicated vector pipes (each has 2 multiply, 2 addition units) • 102.4 GFLOPS peak performance • 65 nm CMOS technology • Up to 64 GB of memory http://www.nec.com/en/de/en/prod/solutions/hpc-solutions/sx-9/index.html
Passing of the Torch IV. In Memoriam
Alan Turing - Centennial • Father of computability with the abstract “Turing Machine” • Father of Colossus • Arguably first digital electronic supercomputer • Broke German enigma codes • Saved 10s of thousands of lives;
Passing of the Torch V. The Flame Burns On
Key HPC Awards • Seymour Cray Award: Charles L. Seitz • For innovation in high-performance message passing architectures and networks • A member of National Academy of Engineering • President of Myricom, Inc. until last year • Projects he worked on include VLSI design, programming and packet-switching techniques for 2nd generation multicomputers, Myrinet high-performance interconnects • Sidney Fernbach Award: Cleve Moler • For fundamental contribution to linear algebra, mathematical software, and enabling tools for computational science • Professor at the University of Michigan, Stanford University and University of New Mexico • Co-founder of MathWorks, Inc. (in 1984) • One of the authors of LINPACK and EISPACK • Member of National Academy of Engineering • Past president of Society for Industrial and Applied Mathematics
Key Awards • Ken Kennedy Award: Susan L. Graham • For foundational compilation algorithms and programming tools; • research and discipline leadership; and exceptional mentoring • Fellow: Association for Computing Machinery, American Association for the Advancement of Science, American Academy of Arts and Sciences, National Academy of Engineering • Awards: IEEE John von Neumann Medal in 2009 • Research includes work on Harmonia, a language-based framework for interactive software development and Titanium, a Java-based parallel programming language, compiler and runtime system • ACM Turing Award: Judea Pearl • For fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning • Professor of Computer Science at UCLA • Pioneered probabilistic and causal reasoning • Created computational foundation for processing information under certainty • Awards: IEEE intelligent systems, ACM Allen Newell, and many others • Member: AAAI, IEEE, National Academy of Engineering, and others
Picking up the Exascale Gauntlet • International Exascale Software Project • 7th meeting in Cologne, Germany • 8th meeting in Kobe, Japan • European Exascale Software Initiative • EESI initial study completed • EESI-2 inaugurated • US DOE X-stack Program • 4 teams selected to develop system software stack • To be announced • Plans in formulation by many nations
Strategic Exascale Issues • Next Generation Execution Model(s) • Runtime systems • Numeric parallel algorithms • Programming models and languages • Transition of legacy codes and methods • Computer architecture • Highly controversial • Must be COTS, but can’t be the same • Managing power and reliability • Good fast machines or just machines fast
Canonical Machine: Selection Process • Calculate histograms for selected non-numeric technical parameters • Architecture: Cluster – 410, MPP – 89, Constellations - 1 • Processor technology: Intel Nehalem – 196, other Intel EM64T – 141, AMD x86_64 – 63, Power – 49, etc. • Interconnect family: GigE – 224, IB – 209, Custom – 29, etc. • OS: Linux – 427, AIX – 28, CNK/SLES 9 – 11, etc. • Find the largest intersection including single modality for each parameter • (Cluster, Intel Nehalem, GigE, Linux): 110 machines matching • Note: this does not always include the largest histogram buckets. For example, in 2003 the dominant system architecture was cluster, but he largest parameter subset is generated for (Constellations, PA-RISC, Myrinet, HP-UX) • In the computed subset, find the centroid for selected numerical parameters • Rmax • Number of compute cores • Processor frequency • Other useful metric would be power, but it is not available for all entries in the list • Order machines in the subset by increasing distance (least square error) from the centroid • The parameter values have to be normalized due to different units used • The canonical machine is found at the head of the list
Centroid Placement in Parameter Space Centroid coordinates: #cores: 10896 Rmax: 64.3 TFlops Processor frequency: 2.82 GHz
The Canonical HPC System – 2012 • Architecture: commodity cluster • Dominant class of HEC system • Processor: 64-bit Intel Xeon Westmere-EP • E56xx (32nm die shrink of E55xx) in most deployed machines • Older E55xx Nehalem in the second place by machine count • Minimal changes since the last year: strong Intel dominance, with AMD and POWER based systems in distant second and third place • Intel maintains the strong lead in total number of cores deployed (4,337,423), followed by AMD Opteron (2,038,956) and IBM POWER (1,434,544) • Canonical example: #315 in the TOP500 • Rmax of 62.3 TFlops • Rpeak of 129 Tflops • 11016 cores • System node • Based on HS22 BladeCenter (3rd most popular system family) • Equipped with Intel X5670 6C clocked at 2.93 GHz (6 cores/12 threads per processor) • Homogeneous • Only 39 accelerated machines in the list; accelerator cores constitute 5.7% of all cores in TOP 500 • IBM systems integrator • Remains the most popular vendor (225 machines in the list) • Gigabit Ethernet interconnect • Infiniband still did not surpass GigE in popularity • Linux • By far #1 OS • Power consumption: 309 kW, 201.6 MFLOPS/W • Industry owned and operated (telecommunication company)
Next Year (if invited) • Leipzig • 10th Anniversary of Retrospective Keynotes • Focus on Impact and Accomplishments across an entire Decade • Application oriented: science and industry • System Software looking forward