440 likes | 448 Views
Learn about the powerful capabilities of GPU processing with CUDA applications on the Fermi architecture, including innovative uses in medical imaging, molecular dynamics, video transcoding, and more. Discover the various tools and libraries available to enhance your CUDA programming experience.
E N D
Product Availability Update ProcessamentoParaleloem GPU’s naArquitetura Fermi Arnaldo Tavares Tesla Sales Manager for Latin America
Quadro or Tesla? TESLATM QUADROTM
GPU Computing CPU + GPU Co-Processing 448 cores 4 cores CPU 48 GigaFlops (DP) GPU 515 GigaFlops (DP) (Average efficiency in Linpack: 50%)
50x – 150x 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland
Increasing Number of Professional CUDA Apps Available Now Future • CUDA C/C++ • PGI • Accelerators • Platform LSF • Cluster Mgr • TauCUDA • Perf Tools • Parallel Nsight • Vis Studio IDE • TotalView • Debugger • MATLAB • PGI CUDA • x86 Tools • PGI CUDA • Fortran • CAPS HMPP • Bright Cluster • Manager • Allinea DDTDebugger • ParaTools • VampirTrace • AccelerEyes • Jacket MATLAB • Wolfram Mathematica • NVIDIA NPP • Perf Primitives • EMPhotonics • CULAPACK • CUDA FFT • CUDA BLAS • Thrust C++ • Template Lib • MAGMA (LAPACK) • NVIDIA • Video Libraries • RNG & SPARSE CUDA Libraries Libraries • Headwave Suite • OpenGeoSolutionsOpenSEIS • GeoStar Seismic Suite • Acceleware • RTM Solver • StoneRidge • RTM • Paradigm • RTM • Panorama Tech Oil & Gas • ffA SVI Pro • VSG • Open Inventor • Seismic City • RTM • Tsunami • RTM • Paradigm • SKUA • AMBER • NAMD • HOOMD • TeraChem • BigDFT • ABINT • Acellera • ACEMD • DL-POLY Bio-Chemistry • GROMACS • LAMMPS • VMD • GAMESS • CP2K • OpenEye ROCS • PIPER • Docking • MUMmerGPU • CUDA-BLASTP • CUDA-MEME Bio-Informatics • HEX Protein • Docking • CUDA-EC • CUDA SW++ • SmithWaterm • GPU-HMMR CAE • ACUSIM • AcuSolve 1.8 • Autodesk • Moldflow • Prometch • Particleworks • Remcom • XFdtd 7.0 • LSTC • LS-DYNA 971 • FluiDyna • OpenFOAM • ANSYS • Mechanical • Metacomp • CFD++ • MSC.Software • Marc 2010.2 • Announced • Available
Increasing Number of Professional CUDA Apps Available Now Future • Adobe Premier Pro CS5 • ARRI • Various Apps • GenArts • Sapphire • TDVision • TDVCodec • Black Magic • Da Vinci • The Foundry • Kronos Video • MainConcept • CUDA Encoder • Fraunhofer • JPEG2000 • Cinnafilm • Pixel Strings • Assimilate • SCRATCH • Elemental • Video • Bunkspeed • Shot (iray) • Refractive SW • Octane • Random Control Arion • ILM • Plume • Autodesk • 3ds Max • Cebas • finalRender • Works Zebra • Zeany Rendering • mental images • iray (OEM) • NVIDIA OptiX (SDK) • Caustic Graphics • Weta Digital • PantaRay • Lightworks • Artisan • Chaos Group • V-Ray GPU • NAG • RNG • Numerix Risk • SciComp • SciFinance • RMS Risk • Mgt Solutions Finance • Murex • MACS • Aquimin • AlphaVision • Hanweck • Options Analy • Agilent • EMPro 2010 • CST Microwave • Agilent ADS • SPICE • Acceleware • FDTD Solver • Rocketick • VeritlogSim EDA • Synopsys • TCAD • SPEAG • SEMCAD X • GaudaOPC • Acceleware • EM Solution • Siemens 4D Ultrasound • Digisens Medical • Schrodinger • Core Hopping • Useful Progress Med • MVTec Machine Vis Other • MotionDSP • Ikena Video • Manifold • GIS • Dalsa Machine Vision • Digital Anarchy Photo • Announced • Available
What if Every Supercomputer Had Fermi? Linpack Teraflops 450 GPUs 110 TeraFlops $2.2 M Top 50 225 GPUs 55 TeraFlops $1.1 M Top 100 150 GPUs 37 TeraFlops $740K Top 150 Top 500 Supercomputers (Nov 2009)
Hybrid ExaScale Trajectory * This is a projection based on Moore’s law and does not represent a committed roadmap
The March of the GPUs NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU
Workstation / Data Center Solutions 2 Tesla M2050/70 GPUs Integrated CPU-GPU Server 2x Tesla M2050/70 GPUs in 1U OEM CPU Server + Tesla S2050/70 4 Tesla GPUs in 2U Workstations Up to 4x Tesla C2050/70 GPUs
How is the GPU Used? • Basic Component: “Stream Multiprocessor” (SM) • SIMD: “Single Instruction Multiple Data” • Same Instruction for all cores, but can operate over different data • “SIMD at SM, MIMD at GPU chip” Source: Presentation from Felipe A. Cruz, Nagasaki University
The Use of GPU’s and Bottleneck Analysis Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology
The Fermi Architecture • 3 billion transistors • 16 x Streaming Multiprocessors (SM’s) • 6 x 64-bit Memory Partitions = 384-bit Memory Interface • Host Interface: connects the GPU to the CPU via PCI-Express • GigaThread global scheduler: distribute thread blocks to SM thread schedulers
SM Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File • 32 CUDA cores per SM (512 total) • 16 x Load/Store Units = source and destin. address calculated for 16 threads per clock • 4 x Special Function Units (sin, cosine, sq. root, etc.) • 64 KB of RAM for shared memory and L1 cache (configurable) • Dual Warp Scheduler Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache
Dual Warp Scheduler • 1 Warp = 32 parallel threads • 2 Warps issued and executed concurrently • Each Warp goes to 16 CUDA Cores • Most instructions can be dual issued (exception: Double Precision instructions) • Dual-Issue Model allows near peak hardware performance
CUDA Core Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File • New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs • Newly designed integer ALU optimized for 64-bit and extended precision operations • Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core CUDA Core Core Core Core Core Dispatch Port Operand Collector Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache
GigaThreadTM Hardware Thread Scheduler (HTS) • Hierarchically manages thousands of simultaneously active threads • 10x faster application context switching (each program receives a time slice of processing resources) • Concurrent kernel execution HTS
GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Kernel 1 Kernel 1 Kernel 2 Ker4 Kernel 2 Kernel 3 Kernel 2 nel Kernel 2 Kernel 5 Time Kernel 3 Kernel 4 Kernel 5 Serial Kernel Execution Parallel Kernel Execution
GigaThread Streaming Data Transfer Engine • Dual DMA engines • Simultaneous CPUGPU and GPUCPU data transfer • Fully overlapped with CPU and GPU processing time • Activity Snapshot: SDT Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1
Cached Memory Hierarchy • First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory • Shared/L1 Cache per SM (64KB) • Improves bandwidth and reduces latency • Unified L2 Cache (768 KB) • Fast, coherent data sharing across all cores in the GPU • Global Memory (up to 6GB)
CUDA: Compute Unified Device Architecture • NVIDIA’s Parallel Computing Architecture • Software Development Platform aimed to the GPU Architecture
Thread Hierarchy • Kernels (simple C program) are executed by thread • Threads are grouped into Blocks • Threads in a Block can synchronize execution • Blocks are grouped in a Grid • Blocks are independent (must be able to be executed at any order Source: Presentation from Felipe A. Cruz, Nagasaki University
Memory and Hardware Hierarchy • Threads access Registers • CUDA Cores execute Threads • Threads within a Block can share data/results via Shared Memory • Streaming Multiprocessors (SM’s) execute Blocks • Grids use Global Memory for result sharing (after kernel-wide global synchronization) • GPU executes Grids Source: Presentation from Felipe A. Cruz, Nagasaki University
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (0, 0) Thread (0, 2) Thread (1, 2) Thread (1, 1) Thread (1, 0) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 2) Thread (4, 0) Thread (4, 1) IDs and Dimensions Threads • 3D IDs, unique within a block Blocks • 2D IDs, unique within a grid Dimensions set at launch time • Can be unique for each grid Built-in variables • threadIdx, blockIdx • blockDim, gridDim
Compiling C for CUDA Applications void serial_function(… ) { ... } void other_function(int ... ) { ... } void saxpy_serial(float ... ) { for(int i = 0; i<n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..); ... } • C CUDA • Key Kernels • Rest of C • Application NVCC (Open64) • CPU Compiler Modify into Parallel CUDA code • CUDA object • files • CPU object • files Linker • CPU-GPU • Executable
C for CUDA : C with a few keywords Standard C Code Parallel C Code
Software Programming Source: Presentation from Andreas Klöckner, NYU
Software Programming Source: Presentation from Andreas Klöckner, NYU
Software Programming Source: Presentation from Andreas Klöckner, NYU
Software Programming Source: Presentation from Andreas Klöckner, NYU
Software Programming Source: Presentation from Andreas Klöckner, NYU
Software Programming Source: Presentation from Andreas Klöckner, NYU
Software Programming Source: Presentation from Andreas Klöckner, NYU
Software Programming Source: Presentation from Andreas Klöckner, NYU
CUDA C/C++ Leadership CUDA Toolkit 1.0 CUDA Toolkit 1.1 CUDA Visual Profiler 2.2 CUDA Toolkit 2.0 CUDA Toolkit 2.3 Parallel Nsight Beta CUDA Toolkit 3.0 • C++ inheritance • Fermi arch support • Tools updates • Driver / RT interop • C Compiler • C Extensions • Single Precision • BLAS • FFT • SDK • 40 examples • Win XP 64 • Atomics support • Multi-GPU • support cuda-gdb HW Debugger • Double Precision • Compiler • Optimizations • Vista 32/64 • Mac OSX • 3D Textures • HW Interpolation • DP FFT • 16-32 Conversion • intrinsics • Performance • enhancements