Product Availability Update

Product Availability Update ProcessamentoParaleloem GPU’s naArquitetura Fermi Arnaldo Tavares Tesla Sales Manager for Latin America

Quadro or Tesla? TESLATM QUADROTM

GPU Computing CPU + GPU Co-Processing 448 cores 4 cores CPU 48 GigaFlops (DP) GPU 515 GigaFlops (DP) (Average efficiency in Linpack: 50%)

50x – 150x 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland

Increasing Number of Professional CUDA Apps Available Now Future • CUDA C/C++ • PGI • Accelerators • Platform LSF • Cluster Mgr • TauCUDA • Perf Tools • Parallel Nsight • Vis Studio IDE • TotalView • Debugger • MATLAB • PGI CUDA • x86 Tools • PGI CUDA • Fortran • CAPS HMPP • Bright Cluster • Manager • Allinea DDTDebugger • ParaTools • VampirTrace • AccelerEyes • Jacket MATLAB • Wolfram Mathematica • NVIDIA NPP • Perf Primitives • EMPhotonics • CULAPACK • CUDA FFT • CUDA BLAS • Thrust C++ • Template Lib • MAGMA (LAPACK) • NVIDIA • Video Libraries • RNG & SPARSE CUDA Libraries Libraries • Headwave Suite • OpenGeoSolutionsOpenSEIS • GeoStar Seismic Suite • Acceleware • RTM Solver • StoneRidge • RTM • Paradigm • RTM • Panorama Tech Oil & Gas • ffA SVI Pro • VSG • Open Inventor • Seismic City • RTM • Tsunami • RTM • Paradigm • SKUA • AMBER • NAMD • HOOMD • TeraChem • BigDFT • ABINT • Acellera • ACEMD • DL-POLY Bio-Chemistry • GROMACS • LAMMPS • VMD • GAMESS • CP2K • OpenEye ROCS • PIPER • Docking • MUMmerGPU • CUDA-BLASTP • CUDA-MEME Bio-Informatics • HEX Protein • Docking • CUDA-EC • CUDA SW++ • SmithWaterm • GPU-HMMR CAE • ACUSIM • AcuSolve 1.8 • Autodesk • Moldflow • Prometch • Particleworks • Remcom • XFdtd 7.0 • LSTC • LS-DYNA 971 • FluiDyna • OpenFOAM • ANSYS • Mechanical • Metacomp • CFD++ • MSC.Software • Marc 2010.2 • Announced • Available

Increasing Number of Professional CUDA Apps Available Now Future • Adobe Premier Pro CS5 • ARRI • Various Apps • GenArts • Sapphire • TDVision • TDVCodec • Black Magic • Da Vinci • The Foundry • Kronos Video • MainConcept • CUDA Encoder • Fraunhofer • JPEG2000 • Cinnafilm • Pixel Strings • Assimilate • SCRATCH • Elemental • Video • Bunkspeed • Shot (iray) • Refractive SW • Octane • Random Control Arion • ILM • Plume • Autodesk • 3ds Max • Cebas • finalRender • Works Zebra • Zeany Rendering • mental images • iray (OEM) • NVIDIA OptiX (SDK) • Caustic Graphics • Weta Digital • PantaRay • Lightworks • Artisan • Chaos Group • V-Ray GPU • NAG • RNG • Numerix Risk • SciComp • SciFinance • RMS Risk • Mgt Solutions Finance • Murex • MACS • Aquimin • AlphaVision • Hanweck • Options Analy • Agilent • EMPro 2010 • CST Microwave • Agilent ADS • SPICE • Acceleware • FDTD Solver • Rocketick • VeritlogSim EDA • Synopsys • TCAD • SPEAG • SEMCAD X • GaudaOPC • Acceleware • EM Solution • Siemens 4D Ultrasound • Digisens Medical • Schrodinger • Core Hopping • Useful Progress Med • MVTec Machine Vis Other • MotionDSP • Ikena Video • Manifold • GIS • Dalsa Machine Vision • Digital Anarchy Photo • Announced • Available

3 of Top5 Supercomputers

What if Every Supercomputer Had Fermi? Linpack Teraflops 450 GPUs 110 TeraFlops $2.2 M Top 50 225 GPUs 55 TeraFlops $1.1 M Top 100 150 GPUs 37 TeraFlops $740K Top 150 Top 500 Supercomputers (Nov 2009)

Hybrid ExaScale Trajectory * This is a projection based on Moore’s law and does not represent a committed roadmap

Tesla Roadmap

The March of the GPUs NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU

Project Denver

Expected Tesla Roadmap with Project Denver

Workstation / Data Center Solutions 2 Tesla M2050/70 GPUs Integrated CPU-GPU Server 2x Tesla M2050/70 GPUs in 1U OEM CPU Server + Tesla S2050/70 4 Tesla GPUs in 2U Workstations Up to 4x Tesla C2050/70 GPUs

Tesla C-Series Workstation GPUs

How is the GPU Used? • Basic Component: “Stream Multiprocessor” (SM) • SIMD: “Single Instruction Multiple Data” • Same Instruction for all cores, but can operate over different data • “SIMD at SM, MIMD at GPU chip” Source: Presentation from Felipe A. Cruz, Nagasaki University

The Use of GPU’s and Bottleneck Analysis Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology

The Fermi Architecture • 3 billion transistors • 16 x Streaming Multiprocessors (SM’s) • 6 x 64-bit Memory Partitions = 384-bit Memory Interface • Host Interface: connects the GPU to the CPU via PCI-Express • GigaThread global scheduler: distribute thread blocks to SM thread schedulers

SM Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File • 32 CUDA cores per SM (512 total) • 16 x Load/Store Units = source and destin. address calculated for 16 threads per clock • 4 x Special Function Units (sin, cosine, sq. root, etc.) • 64 KB of RAM for shared memory and L1 cache (configurable) • Dual Warp Scheduler Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache

Dual Warp Scheduler • 1 Warp = 32 parallel threads • 2 Warps issued and executed concurrently • Each Warp goes to 16 CUDA Cores • Most instructions can be dual issued (exception: Double Precision instructions) • Dual-Issue Model allows near peak hardware performance

CUDA Core Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File • New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs • Newly designed integer ALU optimized for 64-bit and extended precision operations • Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core CUDA Core Core Core Core Core Dispatch Port Operand Collector Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache

Fused Multiply-Add Instruction (FMA)

GigaThreadTM Hardware Thread Scheduler (HTS) • Hierarchically manages thousands of simultaneously active threads • 10x faster application context switching (each program receives a time slice of processing resources) • Concurrent kernel execution HTS

GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Kernel 1 Kernel 1 Kernel 2 Ker4 Kernel 2 Kernel 3 Kernel 2 nel Kernel 2 Kernel 5 Time Kernel 3 Kernel 4 Kernel 5 Serial Kernel Execution Parallel Kernel Execution

GigaThread Streaming Data Transfer Engine • Dual DMA engines • Simultaneous CPUGPU and GPUCPU data transfer • Fully overlapped with CPU and GPU processing time • Activity Snapshot: SDT Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1

Cached Memory Hierarchy • First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory • Shared/L1 Cache per SM (64KB) • Improves bandwidth and reduces latency • Unified L2 Cache (768 KB) • Fast, coherent data sharing across all cores in the GPU • Global Memory (up to 6GB)

CUDA: Compute Unified Device Architecture • NVIDIA’s Parallel Computing Architecture • Software Development Platform aimed to the GPU Architecture

Thread Hierarchy • Kernels (simple C program) are executed by thread • Threads are grouped into Blocks • Threads in a Block can synchronize execution • Blocks are grouped in a Grid • Blocks are independent (must be able to be executed at any order Source: Presentation from Felipe A. Cruz, Nagasaki University

Memory and Hardware Hierarchy • Threads access Registers • CUDA Cores execute Threads • Threads within a Block can share data/results via Shared Memory • Streaming Multiprocessors (SM’s) execute Blocks • Grids use Global Memory for result sharing (after kernel-wide global synchronization) • GPU executes Grids Source: Presentation from Felipe A. Cruz, Nagasaki University

Full View of the Hierarchy Model

Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (0, 0) Thread (0, 2) Thread (1, 2) Thread (1, 1) Thread (1, 0) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 2) Thread (4, 0) Thread (4, 1) IDs and Dimensions Threads • 3D IDs, unique within a block Blocks • 2D IDs, unique within a grid Dimensions set at launch time • Can be unique for each grid Built-in variables • threadIdx, blockIdx • blockDim, gridDim

Compiling C for CUDA Applications void serial_function(… ) { ... } void other_function(int ... ) { ... } void saxpy_serial(float ... ) { for(int i = 0; i<n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..); ... } • C CUDA • Key Kernels • Rest of C • Application NVCC (Open64) • CPU Compiler Modify into Parallel CUDA code • CUDA object • files • CPU object • files Linker • CPU-GPU • Executable

C for CUDA : C with a few keywords Standard C Code Parallel C Code

Software Programming Source: Presentation from Andreas Klöckner, NYU

CUDA C/C++ Leadership CUDA Toolkit 1.0 CUDA Toolkit 1.1 CUDA Visual Profiler 2.2 CUDA Toolkit 2.0 CUDA Toolkit 2.3 Parallel Nsight Beta CUDA Toolkit 3.0 • C++ inheritance • Fermi arch support • Tools updates • Driver / RT interop • C Compiler • C Extensions • Single Precision • BLAS • FFT • SDK • 40 examples • Win XP 64 • Atomics support • Multi-GPU • support cuda-gdb HW Debugger • Double Precision • Compiler • Optimizations • Vista 32/64 • Mac OSX • 3D Textures • HW Interpolation • DP FFT • 16-32 Conversion • intrinsics • Performance • enhancements

Why should I choose Tesla over consumer cards?

Product Availability Update

Product Availability Update

Presentation Transcript

SkillSoft Product Update

Product Calendar Update

Product Update

Product Availability Update

i.MX Product Update

Product Update Seminar

Product Update

Availability Model: status update

Product Update

Galileo: Status and Simulator Availability Update

Product Development Update

Tideway Product Update

Product Development Update

QAD Product Update

Obligatory Product Update

Product Roadmap Update

Current Product Update

Nexus Product Update

Update on Service Availability Monitoring