420 likes | 998 Views
TESLA. GPU Computing. Supercomputing at 1/10 th the Cost http://www.nvidia.com/tesla. c. PARALLEL COMPUTING. PERSONAL COMPUTING. VISUALIZATION. TESLA TM. QUADRO TM. GeForce TM , TEGRA TM. GPGPU Revolutionizes Computing Latency Processor + Throughput processor. CPU. GPU.
E N D
TESLA GPU Computing • Supercomputing at 1/10th the Cost • http://www.nvidia.com/tesla
c PARALLEL COMPUTING PERSONALCOMPUTING VISUALIZATION TESLATM QUADROTM GeForceTM, TEGRATM
GPGPU Revolutionizes Computing Latency Processor + Throughput processor CPU GPU
Tesla Data Center & Workstation GPU Solutions Tesla M-series GPUs M2070M2050M1060 Tesla S-series 1U Systems S2050S1070 Tesla C-series GPUs C2070C2050C1060 Integrated CPU-GPU Servers & Blades OEM CPU Server + Tesla S-series 1U Workstations 2 to 4 Tesla GPUs
GPU Servers Go Mainstream ® Tesla S870 Dec 2007 Tesla S1070 / M1060 2008-2009 Tesla M2050 / M2070 2010
4.5x Lower Power & Cooling Costs37 TeraFlop System : Top 150 System • 7x Less Space Required 2 Racks of GPU+CPUs 15 Racks of CPUs 5x Lower Cost $740 K $3.8 M 4.5x Power Savings every Year $117 K $524 K As per November 2009 Top 500 List
8x Higher Linpack CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz,48 GB memory, $7K, 0.55 kw GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw
The World’s Fastest Supercomputer Tianhe-1A2.507 Petaflop7168 Tesla M2050 GPUsNational Supercomputing Center in Tianjin
Dawning Nebulae Second Fastest Supercomputer in the World 1.27 Petaflop 4640 Tesla GPUs 2x Better Performance / Watt
TSUBAME 2.0 Results from G80 and T10 GPUs on Tsubame 1.2 • Tsubame 2.0 Cluster • 1408 nodes with peak perf • 4224 GPUs = 2175 TFlops • 2816 CPUs = 216 TFlops • Memory = 80.55 TB • SSD = 173.88 TB • HP SL390 Server • 3x NVIDIA Tesla M2050 GPUs • 2x Intel Westmere-EP CPU • 52 GB DDR3 Memory • 2x 60 GB SSD • 2x QDR InfiniBand
1000+ GPU Clusters Around the World St. Petersburg University Norwegian Univ of S & T Nizhegorodsky University Copenhagen Aarhus Kazan Univ Daresbury Lab Groningen Max Planck Institute WestGrid PNNL 256 GPUs Oxford Institute of Physics Wisconsin VaTech Braunschweig Cambridge Utah Argonne Lab Fermi Lab Peking University OSC Maryland Osaka Riken 220 GPUs NERSC CEA Johns Hopkins NCSA 384 GPUs Tsinghua University Chinese Academy of Sciences 2000+ GPUs Harvard KISTI Berkeley Jefferson Labs TACC SNU Georgia Tech Stanford Tokyo Tech 680 GPUs Yonsei Univ of Science & Tech Delaware IIT Delhi UNC Indian Inst of Tropical Meteorology Oak Ridge NIT Calicut Nagasaki NCHC Indian Institute of Science National Taiwan Univ LRDE Dept of Space IIT Madras Anna Univ Curtin University CSIRO 256 GPUs Existing Deployment Prospective Deployment Swinburne University
Increasing Number of Professional CUDA Applications Available Now Future • CUDA C/C++ • PGI • Accelerators • Platform LSF • Cluster Mgr • TauCUDA • Perf Tools • Parallel Nsight • Vis Studio IDE • TotalView • Debugger • PGI CUDA x866 Tools • MATLAB • PGI CUDA • Fortran • CAPS HMPP • Bright Cluster • Manager • Allinea DDTDebugger • ParaTools • VampirTrace • AccelerEyes • Jacket MATLAB • Wolfram Mathematica • NVIDIA NPP • Perf Primitives • EMPhotonics • CULAPACK • CUDA FFT • CUDA BLAS • Thrust C++ • Template Lib • MAGMA (LAPACK) • NVIDIA • Video Libraries • RNG & SPARSE CUDA Libraries Libraries • Headwave Suite • OpenGeoSolutionsOpenSEIS • GeoStar Seismic Suite • Acceleware • RTM Solver • StoneRidge • RTM • Paradigm • RTM • Panorama Tech Oil & Gas • ffA SVI Pro • VSG • Open Inventor • Seismic City • RTM • Tsunami • RTM • Paradigm • SKUA • AMBER • NAMD • HOOMD • TeraChem • BigDFT • ABINT • Acellera • ACEMD • DL-POLY Bio-Chemistry • GROMACS • LAMMPS • VMD • GAMESS • CP2K • OpenEye ROCS • PIPER • Docking • MUMmerGPU • CUDA-BLASTP • CUDA-MEME Bio-Informatics • HEX Protein • Docking • CUDA-EC • CUDA SW++ • SmithWaterm • GPU-HMMR CAE • ACUSIM • AcuSolve 1.8 • Autodesk • Moldflow • Prometch • Particleworks • Remcom • XFdtd 7.0 • ANSYS • Mechanical • FluiDyna • OpenFOAM • LSTC • LS-DYNA 971 • Metacomp • CFD++ • MSC.Software • Marc 2010.2 • Announced • Available
Increasing Number of Professional CUDA Applications Available Now Future • Adobe Premier Pro CS5 • ARRI • Various Apps • GenArts • Sapphire • TDVision • TDVCodec • Black Magic • Da Vinci • The Foundry • Kronos Video • MainConcept • CUDA Encoder • Fraunhofer • JPEG2000 • Cinnafilm • Pixel Strings • Assimilate • SCRATCH • Elemental • Video • Bunkspeed • Shot (iray) • Refractive SW • Octane • Random Control Arion • ILM • Plume • Autodesk • 3ds Max • Cebas • finalRender • Works Zebra • Zeany Rendering • mental images • iray (OEM) • NVIDIA OptiX (SDK) • Caustic Graphics • Weta Digital • PantaRay • Lightworks • Artisan • Chaos Group • V-Ray GPU • NAG • RNG • Numerix Risk • SciComp • SciFinance • RMS Risk • Mgt Solutions Finance • Murex • MACS • Aquimin • AlphaVision • Hanweck • Options Analy • Agilent • EMPro 2010 • CST Microwave • Agilent ADS • SPICE • Acceleware • FDTD Solver • Rocketick • VeritlogSim EDA • Synopsys • TCAD • SPEAG • SEMCAD X • GaudaOPC • Acceleware • EM Solution • MvTec • Machine Vis • Siemens 4D Ultrasound • Digisens Medical • Schrodinger • Core Hopping • Useful Progress Med Other • MotionDSP • Ikena Video • Manifold • GIS • Dalsa Machine Vision • Digital Anarchy Photo • Announced • Available
20+ Oil & Gas Companies Porting to CUDA • Successful Customers • Oil & Gas ISVs
Finance: 10+ Banks Porting to CUDA • Successful Customers 124x Several unannounced 77x • Finance ISVs UnRisk
Defense / Federal Agencies Software Available • Opportunities • Defense Contractors • Federal Agencies • Defense services • Speedups 10x-50x • GIS • Manifold, PCI Geomatics, DigitalGlobe • Signal Processing • GPU VSIPL • MATLAB • GPU Plugin available • UAV video analysis • MotionDSPIkena • Virtual Prototyping • RealityServer • Surveillance, Cryptography
Tesla Bio WorkBench : Bio-Chemistry & Bio-Informatics TeraChem Hex (Docking) • Applications LAMMPS CUDA-MEME CUDA-BLASTP CUDA-EC MUMmerGPU • Community Download, Documentation Technical papers Discussion Forums Benchmarks & Configurations Tesla GPU Clusters Tesla Personal Supercomputer • Platforms
ANSYS Mechanical > 125K Commercial Seats Faster Better Quality =
MATLAB GPU Performance in High-Level Programming Tool 1 million+ Usersin 175+ Countries 3,500+Universities Worldwide Faster Productivity = 1,500+MATLAB/Simulink Books
NVIDIA Developer Eco-System Parallelizing Compilers GPU Compilers Numerical Packages Debuggers & Profilers C C++ Fortran OpenCL DirectCompute Java Python PGI Accelerator CAPS HMPP mCUDA OpenMP MATLAB Mathematica NI LabView pyCUDA cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView Libraries BLAS FFT LAPACK NPP Video Imaging GPULib GPGPU Consultants & Training OEM Solution Providers ANEO GPU Tech
Doing GPU Computing RightCombination of Hardware and Software GPU Computing Applications Java Python Wrappers Direct Compute C++ C OpenCLtm Fortran NVIDIA GPU CUDA Parallel Computing Architecture OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
Parallel Nsight Visual Studio Visual Profiler For Linux cuda-gdb For Linux
Compiling C for CUDA Applications void serial_function(… ) { ... } void other_function(int ... ) { ... } void saxpy_serial(float ... ) { for(int i = 0; i<n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..); ... } • C CUDA • Key Kernels • Rest of C • Application NVCC (Open64) • CPU Compiler Modify into Parallel CUDA code • CUDA object • files • CPU object • files Linker • CPU-GPU • Executable
C for CUDA : C with a few keywords Standard C Code Parallel C Code
CUDA C/C++ Continuous Innovation CUDA Toolkit 1.x CUDA Toolkit 2.x CUDA Toolkit 3.x • New in 3.2 • New cuSPARSE Library • New cuRAND Library (Sobol) • Support for 6GB Tesla & Quadro • Multi-GPU Debugging • Math Library Perf Improvements • Cluster Management Features • Integrated TCC Mode • Fermi arch support • C++ Class Templates • C++ Class Inheritance • Tools updates • cuda-memcheck • GPUDirect™ • 16-way concurrency • Function pointers & recursion • Double Precision • cuda-gdb • Visual Profiler • Compiler • Optimizations • Vista 32/64 • Mac OSX • 3D Textures • HW Interpolation • C Compiler • C Extensions • Single Precision • BLAS • FFT • SDK w/ 40 samples • Win XP 64 • Atomics support • Multi-GPU support • DP FFT • Parallel Nsight (beta) • 16-32 Conversion • intrinsics • Performance • enhancements
Performance Summary Preliminary data
Standard FFT Library: cuFFT 3.2 cuFFT 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
Standard BLAS Library: cuBLAS 3.2 cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
Matrix Size for Best CUBLAS3.2 Performance cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
CULA 1.3 LAPACK Library from EM Photonics Double Precision Results Data Courtesy: EM Photonics
Sparse Matrix-Vector Multiplication (SpMV) SpMv: CUDA 3.0, Tesla C1060 and Tesla C2050 MKL 10.2: Intel Xeon 5550, 2.67 GHz Preliminary data