GPU Technology Conference 2014 Keynote

GPU 5 4 TeraFLOPS 3 2 1 CPU 0 2003 2005 2007 2009 2011 2013

GTC — GROWING AND EXPANDING #1 TOPIC FASTEST GROWING TOPICS FASTEST GROWING TOPICS HPC / Supercomputing Energy Exploration Life Science & Genomics Molecular Dynamics Big Data Analytics Machine Learning Computer Vision 729 429 397 2010 2012 2014

FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision AudioStreamTV 2012 2013 2014

CUDA EVERYWHERE

“Large-scale CFD Applications and a Full GPU Implementation of a Weather Prediction Code on the TSUBAME Supercomputer” Takayuki Aoki Global Scientific Information and Computing Center Tokyo Institute of Technology

PCIe BANDWIDTH BOTTLENECKS GPU CPU PCI Express 16GB/sec CPU Memory 60GB/sec GPU Memory 288GB/sec

PCIe INTRODUCING NVLINK Differential with embedded clock GPU CPU PCIe programming model (w/ DMA+) Unified Memory Cache coherency in Gen 2.0 5 to 12X PCIe

PCIe SWITCH GPU GPU GPU GPU CPU 5X More Bandwidth for Multi-GPU Scaling

Memory Bandwidth 1200 3D MEMORY 3D Chip-on-Wafer integration 1000 800 Many X bandwidth 600 2.5X capacity 400 4X energy efficiency 200 0 2008 2010 2012 2014 2016

Blaise Pascal 1623-1662 Mechanical Calculator Probability Theory Pascal’s Theorem Pascal’s Law

PASCAL NVLink 5 to 12X PCIe 3.0 2 to 4X memory BW & size 3D Memory Module 1/3 size of PCIe card

Pascal Unified Memory 3D Memory NVLink 20 18 16 14 SGEMM / W Normalized Maxwell DX12 12 GPU 10 ROADMAP 8 Kepler Dynamic Parallelism 6 4 Fermi FP64 2 Tesla CUDA 0 2008 2010 2012 2014 2016

person car bird helmet frog MACHINE LEARNING motorcycle Branch of Artificial Intelligence Computers that learn from data person person hammer dog flower pot chair power drill

Machine Learning using Deep Neural Networks Input Result

Building High-level Features Using Large Scale Unsupervised Learning Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng Stanford / Google 1 billion connections 10 million 200x200 pixel images 1,000 machines (16,000 cores) 3 days

GOOGLE BRAIN Today’s Largest Networks  1B connections  10M images  ~3 days  ~30 ExaFLOPS Human Brain  ~100B neurons x 1000 connections  500M images  5,000,000X “Google Brain”  ~150 YottaFLOPS  ~40,000 “Google Brain-Years” 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores SOURCE: Ian Goodfellow

GOOGLE BRAIN STANFORD AI LAB Deep Learning with COTS HPC Systems A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro Stanford / NVIDIA • ICML 2013 “ Now You Can Build Google’s $1M Artificial Brain on the Cheap “ 4 kWatts $33,000 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores -Wired

DEMO: MACHINE LEARNING, SIMPLE TRAINING SET

1.2M 1000 Image training set Classes Weeks of training GPUs EXAFLOPS total to train 2 7 25 DEMO: MACHINE LEARNING, NYU OVERFEAT

CUDA for MACHINE LEARNING Early Adopters Talks @ GTC Use Cases Image Detection Face Recognition Image Analytics for Creative Cloud Speech/Image Recognition Gesture Recognition Video Search & Analytics Speech Recognition & Translation Image Classification Hadoop Recommendation Engines Indexing & Search Search Rankings Recommendation

Big Data & Infinite Compute Turbocharge Deep Learning 800M photos uploaded per day 100 hours of video uploaded per minute Unstructured data exploding 900 120 6,000 800 100 5,000 700 600 80 4,000 Hours (YouTube) Exabytes of data Millions 500 Facebook Instagram Snapchat Flickr 60 3,000 5,379 400 300 40 2,000 200 20 1,000 100 1,104 0 0 0 2007 2008 2009 2010 2011 2012 2013 2014 2007 2008 2009 2010 2011 2012 2013 2010 2015 SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study.

DEMO: TITAN Z REVEAL

5,760 CUDA cores 12GB memory 8 TeraFLOPS $2999

GOOGLE BRAIN STANFORD AI LAB 300X energy efficiency 400X lower cost Fits next to a desk 2 kWatts $12,000 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores 1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores

First CGI Film Nominated for an Academy Award® RenderMan with programmable shading 1.5 hours to render each frame CCI 6/32 minicomputer

2013 Academy Award® Winner BEST VISUAL EFFECTS State-of-the-art water simulator 48 hours to simulate the base water 250 hours to render each frame

DEMO: WHALE

DEMO: FLEX

DEMO: FLAMEWORKS

DEMO: UE4

One is a photo, One is Iray…

8 Kepler-class GPUs Catia 3ds Max 12GB per GPU GPU memory IRAY VCA SCALABLE GPU RENDERING APPLIANCE 23,040 CUDA cores Bunkspeed Maya 2 x 1GigE 2 x 10GigE 1 x InfiniBand Network

DEMO: IRAY / HONDA

Relative Performance CPU-only Workstation Catia 3ds Max IRAY VCA SCALABLE GPU RENDERING APPLIANCE Quadro K5000 Workstation Bunkspeed Maya Iray VCA MSRP $50,000 0 20 40 60 80

GRID GPU in the Cloud

Ben Fathi Chief Technology Officer Horizon DaaS Platform

Mobile CUDA

“10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs

Unify GPU and Tegra Architecture Maxwell Kepler 192 fully programmable CUDA cores Fermi Tegra K1 Tesla 326 GFLOPS GPU ARCHITECTURE 4X energy efficiency over A15 Tegra 4 Tegra 3 MOBILE ARCHITECTURE TEGRA K1 Mobile Super Chip

Computer Vision on CUDA Feature Detection / Tracking ~30 GFLOPS @ 30 Hz Object Recognition / Tracking ~180 GFLOPS @ 30 Hz 3D Scene Interpretation ~280 GFLOPS @ 30 Hz

JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS 192 CUDA cores 326 GFLOPS VisionWorks SDK $192

Your Code Sample Pipelines … VISIONWORKS COMPUTER VISION ON CUDA Object Detection / Tracking Structure from Motion Driver Assistance Computational Photography VisionWorks Primitives … Classifier Corner Detection CUDA Augmented Reality Robotics Jetson TK1

80 Erista Maxwell GPU Single Precision GFLOPS / W Normalized 60 TEGRA ROADMAP Tegra K1 Kepler GPU CUDA 64b & 32b CPU 40 20 Tegra 4 Tegra 3 Tegra 2 0 2011 2012 2013 2014 2015

Andreas Reich Head of Audi Pre-Development

VIDEO: AUDI ADAS

GPU Technology Conference 2014 Keynote