1 / 53

GPU Technology Conference 2014 Keynote

NVIDIA CEO Jen-Hsun Huang introduces NVLink and shares a roadmap of the GPU. Primary topics also include an introduction of the GeForce GTX Titan Z, CUDA for machine learning, and Iray VCA.

nvidia
Download Presentation

GPU Technology Conference 2014 Keynote

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU 5 4 TeraFLOPS 3 2 1 CPU 0 2003 2005 2007 2009 2011 2013

  2. GTC — GROWING AND EXPANDING #1 TOPIC FASTEST GROWING TOPICS FASTEST GROWING TOPICS HPC / Supercomputing Energy Exploration Life Science & Genomics Molecular Dynamics Big Data Analytics Machine Learning Computer Vision 729 429 397 2010 2012 2014

  3. FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision AudioStreamTV 2012 2013 2014

  4. CUDA EVERYWHERE

  5. “Large-scale CFD Applications and a Full GPU Implementation of a Weather Prediction Code on the TSUBAME Supercomputer” Takayuki Aoki Global Scientific Information and Computing Center Tokyo Institute of Technology

  6. PCIe BANDWIDTH BOTTLENECKS GPU CPU PCI Express 16GB/sec CPU Memory 60GB/sec GPU Memory 288GB/sec

  7. PCIe INTRODUCING NVLINK Differential with embedded clock GPU CPU PCIe programming model (w/ DMA+) Unified Memory Cache coherency in Gen 2.0 5 to 12X PCIe

  8. PCIe SWITCH GPU GPU GPU GPU CPU 5X More Bandwidth for Multi-GPU Scaling

  9. Memory Bandwidth 1200 3D MEMORY 3D Chip-on-Wafer integration 1000 800 Many X bandwidth 600 2.5X capacity 400 4X energy efficiency 200 0 2008 2010 2012 2014 2016

  10. Blaise Pascal 1623-1662 Mechanical Calculator Probability Theory Pascal’s Theorem Pascal’s Law

  11. PASCAL NVLink 5 to 12X PCIe 3.0 2 to 4X memory BW & size 3D Memory Module 1/3 size of PCIe card

  12. Pascal Unified Memory 3D Memory NVLink 20 18 16 14 SGEMM / W Normalized Maxwell DX12 12 GPU 10 ROADMAP 8 Kepler Dynamic Parallelism 6 4 Fermi FP64 2 Tesla CUDA 0 2008 2010 2012 2014 2016

  13. person car bird helmet frog MACHINE LEARNING motorcycle Branch of Artificial Intelligence Computers that learn from data person person hammer dog flower pot chair power drill

  14. Machine Learning using Deep Neural Networks Input Result

  15. Building High-level Features Using Large Scale Unsupervised Learning Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng Stanford / Google 1 billion connections 10 million 200x200 pixel images 1,000 machines (16,000 cores) 3 days

  16. GOOGLE BRAIN Today’s Largest Networks  1B connections  10M images  ~3 days  ~30 ExaFLOPS Human Brain  ~100B neurons x 1000 connections  500M images  5,000,000X “Google Brain”  ~150 YottaFLOPS  ~40,000 “Google Brain-Years” 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores SOURCE: Ian Goodfellow

  17. GOOGLE BRAIN STANFORD AI LAB Deep Learning with COTS HPC Systems A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro Stanford / NVIDIA • ICML 2013 “ Now You Can Build Google’s $1M Artificial Brain on the Cheap “ 4 kWatts $33,000 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores -Wired

  18. DEMO: MACHINE LEARNING, SIMPLE TRAINING SET

  19. 1.2M 1000 Image training set Classes Weeks of training GPUs EXAFLOPS total to train 2 7 25 DEMO: MACHINE LEARNING, NYU OVERFEAT

  20. CUDA for MACHINE LEARNING Early Adopters Talks @ GTC Use Cases Image Detection Face Recognition Image Analytics for Creative Cloud Speech/Image Recognition Gesture Recognition Video Search & Analytics Speech Recognition & Translation Image Classification Hadoop Recommendation Engines Indexing & Search Search Rankings Recommendation

  21. Big Data & Infinite Compute Turbocharge Deep Learning 800M photos uploaded per day 100 hours of video uploaded per minute Unstructured data exploding 900 120 6,000 800 100 5,000 700 600 80 4,000 Hours (YouTube) Exabytes of data Millions 500 Facebook Instagram Snapchat Flickr 60 3,000 5,379 400 300 40 2,000 200 20 1,000 100 1,104 0 0 0 2007 2008 2009 2010 2011 2012 2013 2014 2007 2008 2009 2010 2011 2012 2013 2010 2015 SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study.

  22. DEMO: TITAN Z REVEAL

  23. 5,760 CUDA cores 12GB memory 8 TeraFLOPS $2999

  24. GOOGLE BRAIN STANFORD AI LAB 300X energy efficiency 400X lower cost Fits next to a desk 2 kWatts $12,000 600 kWatts $5,000,000 1,000 CPU Servers 2,000 CPUs • 16,000 cores 1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores

  25. First CGI Film Nominated for an Academy Award® RenderMan with programmable shading 1.5 hours to render each frame CCI 6/32 minicomputer

  26. 2013 Academy Award® Winner BEST VISUAL EFFECTS State-of-the-art water simulator 48 hours to simulate the base water 250 hours to render each frame

  27. DEMO: WHALE

  28. DEMO: FLEX

  29. DEMO: FLAMEWORKS

  30. DEMO: UE4

  31. One is a photo, One is Iray…

  32. 8 Kepler-class GPUs Catia 3ds Max 12GB per GPU GPU memory IRAY VCA SCALABLE GPU RENDERING APPLIANCE 23,040 CUDA cores Bunkspeed Maya 2 x 1GigE 2 x 10GigE 1 x InfiniBand Network

  33. DEMO: IRAY / HONDA

  34. Relative Performance CPU-only Workstation Catia 3ds Max IRAY VCA SCALABLE GPU RENDERING APPLIANCE Quadro K5000 Workstation Bunkspeed Maya Iray VCA MSRP $50,000 0 20 40 60 80

  35. GRID GPU in the Cloud

  36. Ben Fathi Chief Technology Officer Horizon DaaS Platform

  37. Mobile CUDA

  38. “10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs

  39. Unify GPU and Tegra Architecture Maxwell Kepler 192 fully programmable CUDA cores Fermi Tegra K1 Tesla 326 GFLOPS GPU ARCHITECTURE 4X energy efficiency over A15 Tegra 4 Tegra 3 MOBILE ARCHITECTURE TEGRA K1 Mobile Super Chip

  40. Computer Vision on CUDA Feature Detection / Tracking ~30 GFLOPS @ 30 Hz Object Recognition / Tracking ~180 GFLOPS @ 30 Hz 3D Scene Interpretation ~280 GFLOPS @ 30 Hz

  41. JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS 192 CUDA cores 326 GFLOPS VisionWorks SDK $192

  42. Your Code Sample Pipelines … VISIONWORKS COMPUTER VISION ON CUDA Object Detection / Tracking Structure from Motion Driver Assistance Computational Photography VisionWorks Primitives … Classifier Corner Detection CUDA Augmented Reality Robotics Jetson TK1

  43. 80 Erista Maxwell GPU Single Precision GFLOPS / W Normalized 60 TEGRA ROADMAP Tegra K1 Kepler GPU CUDA 64b & 32b CPU 40 20 Tegra 4 Tegra 3 Tegra 2 0 2011 2012 2013 2014 2015

  44. Andreas Reich Head of Audi Pre-Development

  45. VIDEO: AUDI ADAS

More Related