580 likes | 906 Views
GPU based cloud computing. Dairsie Latimer, Petapath, UK. Petapath. About Petapath. Founded in 2008 to focus on delivering innovative hardware and software solutions into the high performance computing (HPC) markets
E N D
GPU based cloud computing Dairsie Latimer, Petapath, UK Petapath
About Petapath Founded in 2008 to focus on delivering innovative hardware andsoftware solutions into the high performance computing (HPC) markets Partnered with HP and SGI to deliverer two Petascale prototypesystems as part of the PRACE WP8 programme The system is a testbed for new ideas in usability, scalability andefficiency of large computer installations Active in exploiting emerging standards for acceleration technologies andare members of Khronos group and sit on the OpenCL working committee We also provide consulting expertise for companies wishing to explore the advantages offered by heterogeneous systems Petapath
What is Heterogeneous or GPU Computing? x86 GPU PCIe bus Computing with CPU + GPU Heterogeneous Computing
Low Latency or High Throughput? • CPU • Optimised for low-latency access to cached data sets • Control logic for out-of-order and speculative execution • GPU • Optimised for data-parallel, throughput computation • Architecture tolerant of memory latency • More transistors dedicated to computation
NVIDIA GPU Computing Ecosystem CUDA Development Specialist TPP / OEM ISV CUDA Training Company Hardware Architect VAR
Science is Desperate for Throughput Gigaflops 1 Exaflop 1,000,000,000 1 Petaflop Bacteria 100s of Chromatophores 1,000,000 Chromatophore 50M atoms 1,000 Ribosome 2.7M atoms F1-ATPase 327K atoms Ran for 8 months to simulate 2 nanoseconds Estrogen Receptor 36K atoms 1 BPTI 3K atoms 1997 2003 2006 2010 2012 1982
Power Crisis in Supercomputing Household Power Equivalent Exaflop City 25,000,000 Watts 7,000,000 Watts Petaflop Town Jaguar Los Alamos 850,000 Watts Teraflop Neighborhood 60,000 Watts Gigaflop Block 1982 1996 2008 2020
Enter the GPU GeForce® Entertainment TeslaTM High-Performance Computing Quadro® Design & Creation NVIDIA GPU Product Families
3 billion transistors Up to 2× the cores (C2050 has 448) Up to 8× the peak DP performance ECC on all memories L1 and L2 caches Improved memory bandwidth (GDDR5) Up to 1 Terabyte of GPU memory Concurrent kernels Hardware support for C++ Introducing the ‘Fermi’ Tesla ArchitectureThe Soul of a Supercomputer in the body of a GPU DRAM I/F DRAM I/F DRAM I/F HOST I/F L2 DRAM I/F Giga Thread DRAM I/F DRAM I/F
Design Goal of Fermi Expand performance sweet spot of the GPU Bring more users, more applications to the GPU Data Parallel GPU Instruction Parallel CPU Many Decisions Large Data Sets
Streaming Multiprocessor Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch • 32 CUDA cores per SM (512 total) • 8× peak double precision floating point performance • 50% of peak single precision • Dual Thread Scheduler • 64 KB of RAM for shared memory and L1 cache (configurable) Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units × 16 Special Func Units × 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache
CUDA Core Architecture New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs Fused multiply-add (FMA) instruction for both single and double precision New integer ALU optimized for64-bit and extended precisionoperations Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core CUDA Core Core Core Core Core Dispatch Port Operand Collector Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache
Cached Memory Hierarchy • First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory • L1 Cache per SM (32 cores) • Improves bandwidth and reduces latency • Unified L2 Cache (768 KB) • Fast, coherent data sharing across all cores in the GPU DRAM I/F DRAM I/F HOST I/F DRAM I/F L2 Giga Thread DRAM I/F Parallel DataCache™Memory Hierarchy DRAM I/F DRAM I/F
Larger, Faster, Resilient Memory Interface GDDR5 memory interface 2× signaling speed of GDDR3 Up to 1 Terabyte of memory attached to GPU Operate on larger data sets (3 and 6 GB Cards) ECC protection for GDDR5 DRAM All major internal memories are ECC protected Register file, L1 cache, L2 cache DRAM I/F DRAM I/F HOST I/F DRAM I/F L2 Giga Thread DRAM I/F DRAM I/F DRAM I/F
GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Kernel 1 Kernel 1 Kernel 2 Ker4 Kernel 2 Kernel 2 Kernel 3 nel Kernel 2 Kernel 5 Time Kernel 3 Kernel 4 Kernel 5 Serial Kernel Execution Parallel Kernel Execution
GigaThread Streaming Data Transfer Engine Dual DMA engines Simultaneous CPUGPU and GPUCPU data transfer Fully overlapped with CPU and GPU processing time Activity Snapshot: SDT Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1
Enhanced Software Support Many new features in CUDA Toolkit 3.0 To be released on Friday Including early support for the Fermi architecture: Native 64-bit GPU support Multiple Copy Engine support ECC reporting Concurrent Kernel Execution Fermi HW debugging support in cuda-gdb
Enhanced Software Support OpenCL 1.0 Support First class language citizen in CUDA Architecture Supports ICD (so interoperability between vendors is a possibility) Profiling support available Debug support coming to Parallel Nsight (NEXUS) soon gDebugger CL from graphicREMEDY Third party OpenCL profiler/debugger/memory checker Software Tools Ecosystem is starting to grow Given boost by existence of OpenCL
“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is "expected to be 10-times more powerful than today's fastest supercomputer." Since ORNL's Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 PFlops…. …we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.” September 30 2009
Tesla GPU Computing Products: 10 Series SuperMicro 1U GPU SuperServer Tesla S1070 1U System Tesla C1060 Computing Board Tesla Personal Supercomputer
Tesla GPU Computing Products: 20 Series Tesla S2050 1U System Tesla S2070 1U System Tesla C2050 Computing Board Tesla C2070 Computing Board
Data Centers: Space and Energy Limited Traditional Data Center Cluster 1000’s of cores 1000’s of servers Quad-core CPU 8 cores per server 2x Performance requires 2x Number of Servers Heterogeneous Data Center Cluster 10,000’s of cores 100’s of servers Augment/replacehost servers
Cluster Deployment • Now a number of GPU aware Cluster Management Systems • ActiveEon ProActive Parallel Suite® Version 4.2 • Platform Cluster Manager and HPC Workgroup • Streamline Computing GPU Environment (SCGE) • Not just installation aids • i.e. putting the driver and toolkits in the right place • now starting to provide GPU node discovery and job steering • NVIDIA and Mellanox • Better interop. between Mellanox IF adapters and NVIDIA Tesla GPUs • Can provide as much as a 30% performance improvement by eliminating unnecessary data movement in a multi node heterogeneous application
Cluster Deployment • A number of cluster and distributed debug tools now support CUDA and NVIDIA Tesla • Allinea® DDT for NVIDIA CUDA • Extends well known Distributed Debugging Tool (DDT) with CUDA support • TotalView® debugger (part of an Early Experience Program) • Extends with CUDA support, have also announced intentions to support OpenCL • Both based on the Parallel Nsight (NEXUS) Debugging API
NVIDIA Reality Server 3.0 • Cloud computing platform for running 3D web applications • Consists of an Tesla RS GPU-based server cluster running RealityServer software from mental images • Deployed in a number of different sizes • From 2 – 100’s of 1U Servers • iray® - Interactive Photorealistic Rendering Technology • Streams interactive 3D applications to any web connected device • Designers and architects can now share and visualize complex 3D models under different lighting and environmental conditions
Distributed Computing Projects Traditional distributed computing projects have beenmaking use of GPUs for some time (non-commercial) Typically have 000’s to 10,000’s of contributors Folding@Home has access to 6.5 PFLOPS of compute Of which ~95% comes from GPUs or PS3s Many are bio-informatics, molecular dynamicsand quantum chemistry codes Represent the current sweet spot applications Ubiquity of GPUs in home systems helps
Distributed Computing Projects • Folding@Home • Directed by Prof. Vijay Pande at Stanford University (http://folding.stanford.edu/) • Most recent GPU3 Core based on OpenMM 1.0 (https://simtk.org/home/openmm) • OpenMM library provides tools for molecular modeling simulation • Can be hooked into any MM application, allowing that code to domolecular modeling with minimal extra effort • OpenMM has a strong emphasis on hardware acceleration providingnot just a consistent API, but much greater performance • Current NVIDIA target is via CUDA Toolkit 2.3 • OpenMM 1.0 also provides Beta support for OpenCL • OpenCL is long term convergence software platform
Distributed Computing Projects • Berkeley Open Infrastructure for Network Computing • BOINC project (http://boinc.berkeley.edu/) • Platform infrastructure originally evolved from SETI@home • Many projects use BOINC and several of these have heterogeneous compute implementations (http://boinc.berkeley.edu/wiki/GPU_computing) • Examples include: • GPUGRID.net • SETI@home • Milkyway@home (IEEE 754 Double precision capable GPU required) • AQUA@home • Lattice • Collatz Conjecture
Distributed Computing Projects • GPUGRID.net • Dr. Gianni De Fabritiis,Research Group of Biomedical InformaticsUniversity Pompeu Fabra-IMIM, Barcelona • Uses GPUs to deliver high-performance all-atom biomolecular simulation of proteins using ACEMD (http://multiscalelab.org/acemd) • ACEMD is a production bio-molecular dynamics code specially optimized to run on graphics processing units (GPUs) from NVIDIA • It reads CHARMM/NAMD and AMBER input files with a simple and powerful configuration interface • A commercial implementation of ACEMD is available from Acellera Ltd (http://www.acellera.com/acemd/) • What makes this particularly interesting is that it is implemented using OpenCL
Distributed Computing Projects Have had to use brute force methods to deal with robustness Run the same WU with multiple users and compare results Running on purpose designed heterogeneous grids with ECC Means that some of the paranoia can be relaxed(can at least detect there have been soft errors or WU corruption) Results in better throughput on these systems But does result in divergence between Consumer and HPC devices Should be compensated for by HPC class devices being about 4x faster
Tesla Bio Workbench Accelerating New Science January, 2010 http://www.nvidia.com/bio_workbench
Introducing Tesla Bio WorkBench TeraChem LAMMPS MUMmerGPU GPU-AutoDock • Applications • Community Download, Documentation Technical papers Discussion Forums Benchmarks & Configurations Tesla GPU Clusters Tesla Personal Supercomputer • Platforms
Tesla Bio Workbench Applications AMBER (MD) ACEMD (MD) GROMACS (MD) GROMOS (MD) LAMMPS (MD) NAMD (MD) TeraChem (QC) VMD (Visualization MD & QC) Docking GPU AutoDock Sequence analysis CUDASW++ (SmithWaterman) MUMmerGPU GPU-HMMER CUDA-MEME Motif Discovery
Recommended Hardware Configurations Up to 4 Tesla C1060s per workstation 4GB main memory / GPU Tesla S1070 1U 4 GPUs per 1U Integrated CPU-GPU Server 2 GPUs per 1U + 2 CPUs Tesla GPU Clusters Tesla Personal Supercomputer Specifics at http://www.nvidia.com/bio_workbench
Molecular Dynamics andQuantum Chemistry Applications AMBER (MD) ACEMD (MD) HOOMD (MD) GROMACS (MD) • LAMMPS (MD) • NAMD (MD) • TeraChem (QC) • VMD (Viz. MD & QC) • Typical speed ups of 3-8x on a single Tesla C1060 vs Modern 1U • Some applications (compute bound) show 20-100x speed ups
Usage of TeraGrid National Supercomputing Grid Half of the cycles
Summary ‘Fermi’ debuts HPC/Enterprise features Particularly ECC and high performance double precision Software development environments are now more mature Significant software ecosystem is starting to emerge Broadening availability of development tools, libraries and applications Heterogeneous (GPU) aware cluster management systems Economics, open standards and improving programming methodologies Heterogeneous computing is gradually changing long held perception that it is just an ‘exotic’ niche technology
AMBER Molecular Dynamics Implicit solvent GB results 1 Tesla GPU 8x faster than 2 quad-core CPUs Generalized Born Simulations Alpha now • Generalized Born 7x 8.6x • PME: Particle Mesh Ewald • Beta release Q1 2010 • Multi-GPU + MPI support • Beta 2 release Q2 2010 More Info http://www.nvidia.com/object/amber_on_tesla.html Data courtesy of San Diego Supercomputing Center
GROMACS Molecular Dynamics PME results 1 Tesla GPU 3.5x-4.7x faster than CPU Beta now • Particle Mesh Ewald (PME) • Implicit solvent GB • Arbitrary forms of non-bonded interactions GROMACS on Tesla GPU Vs CPU Reaction-Field Cutoffs Particle-Mesh-Ewald (PME) 22x 3.5x • Multi-GPU + MPI support • Beta 2 release Q2 2010 5.2x More Info http://www.nvidia.com/object/gromacs_on_tesla.html Data courtesy of Stockholm Center for Biomembrane Research
HOOMD Blue Molecular Dynamics Written bottom-up for CUDA GPUs Modeled after LAMMPS Supports multiple GPUs 1 Tesla GPU outperforms 32 CPUs running LAMMPS More Info http://www.nvidia.com/object/hoomd_on_tesla.html Data courtesy of University of Michigan
LAMMPS: Molecular Dynamics on a GPU Cluster • Available as beta on CUDA • Cut-off based non-bonded terms • 2 GPUs outperforms 24 CPUs • PME based electrostatic • Preliminary results: 5X speed-up • Multiple GPU + MPI support enabled 2 GPUs = 24 CPUs More Info http://www.nvidia.com/object/lammps_on_tesla.html Data courtesy of Scott Hampton & Pratul K. Agarwal Oak Ridge National Laboratory
NAMD: Scaling Molecular Dynamics on a GPU Cluster • Feature complete on CUDA : available in NAMD 2.7 Beta 2 • Full electrostatics with PME • Multiple time-stepping • 1-4 Exclusions • 4 GPU Tesla PSC outperforms • 8 CPU servers • Scales to a GPU cluster 4 GPUs = 16 CPUs More Info http://www.nvidia.com/object/namd_on_tesla.html Data courtesy of Theoretical and Computational Bio-physics Group, UIUC