530 likes | 664 Views
NVIDIA CEO Jen-Hsun Huang introduces NVLink and shares a roadmap of the GPU. Primary topics also include an introduction of the GeForce GTX Titan Z, CUDA for machine learning, and Iray VCA.
E N D
GPU 5 4 TeraFLOPS 3 2 1 CPU 0 2003 2005 2007 2009 2011 2013
GTC — GROWING AND EXPANDING #1 TOPIC FASTEST GROWING TOPICS FASTEST GROWING TOPICS HPC / Supercomputing Energy Exploration Life Science & Genomics Molecular Dynamics Big Data Analytics Machine Learning Computer Vision 729 429 397 2010 2012 2014
FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision AudioStreamTV 2012 2013 2014
“Large-scale CFD Applications and a Full GPU Implementation of a Weather Prediction Code on the TSUBAME Supercomputer” Takayuki Aoki Global Scientific Information and Computing Center Tokyo Institute of Technology
PCIe BANDWIDTH BOTTLENECKS GPU CPU PCI Express 16GB/sec CPU Memory 60GB/sec GPU Memory 288GB/sec
PCIe INTRODUCING NVLINK Differential with embedded clock GPU CPU PCIe programming model (w/ DMA+) Unified Memory Cache coherency in Gen 2.0 5 to 12X PCIe
PCIe SWITCH GPU GPU GPU GPU CPU 5X More Bandwidth for Multi-GPU Scaling
Memory Bandwidth 1200 3D MEMORY 3D Chip-on-Wafer integration 1000 800 Many X bandwidth 600 2.5X capacity 400 4X energy efficiency 200 0 2008 2010 2012 2014 2016
Blaise Pascal 1623-1662 Mechanical Calculator Probability Theory Pascal’s Theorem Pascal’s Law
PASCAL NVLink 5 to 12X PCIe 3.0 2 to 4X memory BW & size 3D Memory Module 1/3 size of PCIe card
Pascal Unified Memory 3D Memory NVLink 20 18 16 14 SGEMM / W Normalized Maxwell DX12 12 GPU 10 ROADMAP 8 Kepler Dynamic Parallelism 6 4 Fermi FP64 2 Tesla CUDA 0 2008 2010 2012 2014 2016
person car bird helmet frog MACHINE LEARNING motorcycle Branch of Artificial Intelligence Computers that learn from data person person hammer dog flower pot chair power drill
Machine Learning using Deep Neural Networks Input Result
Building High-level Features Using Large Scale Unsupervised Learning Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng Stanford / Google 1 billion connections 10 million 200x200 pixel images 1,000 machines (16,000 cores) 3 days
GOOGLE BRAIN Today’s Largest Networks 1B connections 10M images ~3 days ~30 ExaFLOPS Human Brain ~100B neurons x 1000 connections 500M images 5,000,000X “Google Brain” ~150 YottaFLOPS ~40,000 “Google Brain-Years” 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores SOURCE: Ian Goodfellow
GOOGLE BRAIN STANFORD AI LAB Deep Learning with COTS HPC Systems A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro Stanford / NVIDIA • ICML 2013 “ Now You Can Build Google’s $1M Artificial Brain on the Cheap “ 4 kWatts $33,000 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores -Wired
1.2M 1000 Image training set Classes Weeks of training GPUs EXAFLOPS total to train 2 7 25 DEMO: MACHINE LEARNING, NYU OVERFEAT
CUDA for MACHINE LEARNING Early Adopters Talks @ GTC Use Cases Image Detection Face Recognition Image Analytics for Creative Cloud Speech/Image Recognition Gesture Recognition Video Search & Analytics Speech Recognition & Translation Image Classification Hadoop Recommendation Engines Indexing & Search Search Rankings Recommendation
Big Data & Infinite Compute Turbocharge Deep Learning 800M photos uploaded per day 100 hours of video uploaded per minute Unstructured data exploding 900 120 6,000 800 100 5,000 700 600 80 4,000 Hours (YouTube) Exabytes of data Millions 500 Facebook Instagram Snapchat Flickr 60 3,000 5,379 400 300 40 2,000 200 20 1,000 100 1,104 0 0 0 2007 2008 2009 2010 2011 2012 2013 2014 2007 2008 2009 2010 2011 2012 2013 2010 2015 SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study.
5,760 CUDA cores 12GB memory 8 TeraFLOPS $2999
GOOGLE BRAIN STANFORD AI LAB 300X energy efficiency 400X lower cost Fits next to a desk 2 kWatts $12,000 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores 1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores
First CGI Film Nominated for an Academy Award® RenderMan with programmable shading 1.5 hours to render each frame CCI 6/32 minicomputer
2013 Academy Award® Winner BEST VISUAL EFFECTS State-of-the-art water simulator 48 hours to simulate the base water 250 hours to render each frame
One is a photo, One is Iray…
8 Kepler-class GPUs Catia 3ds Max 12GB per GPU GPU memory IRAY VCA SCALABLE GPU RENDERING APPLIANCE 23,040 CUDA cores Bunkspeed Maya 2 x 1GigE 2 x 10GigE 1 x InfiniBand Network
Relative Performance CPU-only Workstation Catia 3ds Max IRAY VCA SCALABLE GPU RENDERING APPLIANCE Quadro K5000 Workstation Bunkspeed Maya Iray VCA MSRP $50,000 0 20 40 60 80
GRID GPU in the Cloud
Ben Fathi Chief Technology Officer Horizon DaaS Platform
“10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs
Unify GPU and Tegra Architecture Maxwell Kepler 192 fully programmable CUDA cores Fermi Tegra K1 Tesla 326 GFLOPS GPU ARCHITECTURE 4X energy efficiency over A15 Tegra 4 Tegra 3 MOBILE ARCHITECTURE TEGRA K1 Mobile Super Chip
Computer Vision on CUDA Feature Detection / Tracking ~30 GFLOPS @ 30 Hz Object Recognition / Tracking ~180 GFLOPS @ 30 Hz 3D Scene Interpretation ~280 GFLOPS @ 30 Hz
JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS 192 CUDA cores 326 GFLOPS VisionWorks SDK $192
Your Code Sample Pipelines … VISIONWORKS COMPUTER VISION ON CUDA Object Detection / Tracking Structure from Motion Driver Assistance Computational Photography VisionWorks Primitives … Classifier Corner Detection CUDA Augmented Reality Robotics Jetson TK1
80 Erista Maxwell GPU Single Precision GFLOPS / W Normalized 60 TEGRA ROADMAP Tegra K1 Kepler GPU CUDA 64b & 32b CPU 40 20 Tegra 4 Tegra 3 Tegra 2 0 2011 2012 2013 2014 2015
Andreas Reich Head of Audi Pre-Development