380 likes | 397 Views
Deep Learning with TensorFlow and Spark using GPUS and Docker Containers. Tom Phelan BlueData CTO / HPE Fellow (BlueData was recently acquired by HPE) thomas.phelan@hpe.com , @tapbluedata. Agenda. Big Data, Artificial Intelligence, Machine Learning, and Deep Learning
E N D
Deep Learning with TensorFlow and Spark using GPUS and Docker Containers • Tom Phelan • BlueData CTO / HPE Fellow • (BlueData was recently acquired by HPE) • thomas.phelan@hpe.com, @tapbluedata
Agenda • Big Data, Artificial Intelligence, Machine Learning, and Deep Learning • Common Enterprise Use Cases • Deep Learning Application Requirements • Distributed TensorFlow & Horovod in Containers with GPUs • Use of Docker Containers to meet Requirements • Challenges and Solutions • Lessons Learned • Key Takeaways and Recommendations
Big Data Big Data refers to data sets that are so voluminous and complex that traditional data-processing application software are inadequate to deal with them. Characterized by: volume, variety, velocity Source: https://en.wikipedia.org/wiki/Big_Data
Let’s Get Grounded … What is AI? Deep learning (DL) Subset of ML, using deep artificial neural networks as models, inspired by the structure and function of the human brain. Artificial intelligence (AI) Mimics human behavior. Any technique that enables machines to solve a task in a way like humans do. Deep learning Example: Self-driving car Example: Siri Machine learning (ML) Algorithms that allow computers to learn from examples without being explicitly programmed. Machine learning Artificial intelligence Example: Google Maps
Game Changing Innovation Gartner 2019 CIO Agenda Q: Which technology areas do you expect will be a game changer for your organization? Source: Gartner, Insights From the 2019 CIO Agenda Report, by Andy Rowsell-Jones, et al.
Why should you be interested in AI / ML / DL? AI and advanced analytics represent 2 of the top 3 CIO priorities Everyone wants AI / ML / DL and advanced analytics…. AI and advanced analytics infrastructure could constitute 15-20% of the market by 20211 Enterprise AI adoption 2.7X growth in last 4 years2 ….but face many challenges Use cases New roles, skill gaps Culture and change Data preparation Legacy infrastructure 1 IDC. Goldman Sachs. HPE Corporate Strategy.2018 2 Gartner - “2019 CIO Survey: CIOs Have Awoken to the Importance of AI”
Example Enterprise Deep Learning Use Cases Fraud Detection Medical Diagnosis Prediction Credit Cards Car Loans Detection / Prediction of Cancer, Alzheimer’s, Pneumonia Weather ForecastGas & Oil Location Buyer / Trader Action
AI / ML / DL Adoption in the Enterprise Financial servicesFraud detection, ID verification Government Cyber-security, smart cities and utilities Energy Seismic and reservoir modeling Retail Video surveillance, shopping patterns Health Personalized medicine, image analytics Consumer tech Chatbots Service providers Media delivery Manufacturing Predictive and prescriptive maintenance
Financial Services Use Cases Wide Range of ML / DL Use Cases for Wholesale / Commercial Banking, Credit Card / Payments, Retail Banking, etc. Other Customer Segmentation CLV Prediction and Recommendation Fraud Detection Risk Modeling & Credit Worthiness Check • Image Recognition • NLP • Security • Video Analysis • Behavioral Analysis • Understanding Customer Quadrant • Effective Messaging & Improved Engagement • Targeted Customer Support • Enhanced Retention • Historical Purchase View • Pattern Recognition • Retention Strategy • Upsell • Cross-Sell • Nurturing • Real-Time Transactions • Credit Card • Merchant • Collusion • Impersonation • Social Engineering Fraud • Loan Defaults • Delayed Payments • Liquidity • Market & Currencies • Purchases and Payments • Time Series
Fraud Detection Use Case • One of the most common use cases for ML / DL in Financial Services is to detect and prevent fraud • This requires: • Distributed Big Data processing frameworks such as Spark • ML / DL tools such as TensorFlow, H2O, and others • Continuous model training and deployment • Multiple large data sets
ML / DL in Healthcare – Use Cases • Precision Medicine and Personal Sensing • Disease prediction, diagnosis, and detection (e.g. genomics research) • Using data from local sensors (e.g. mobile phones) to identify human behavior • Electronic Health Record (EHR) correlation • “Smart” health records • Improved Clinical Workflow • Decision support for clinicians • Claims Management and Fraud Detection • Identify fraudulent claims • Drug Discovery and Development
360° View of the Patient Demographics Visit Labs Patient Rx Diagnosis Care Site Genomics Studies
Sample Architecture & Applications Electronic Health Record Systems Monitors / Devices Kafka Connect Database Access Secure HDFS Data Lake Centralized Publisher Subscriber Hub Model Build Local Store Publishers Promotion Results / Feedback Speed Layer Model Score
Requirements for Deep Learning Q: What hardware is commonly used for deep learning and matrix multiplications? A: Graphics Processing Units (aka GPUs)
Matrix Multiplication Source: https://www.khanacademy.org/math/precalculus/precalc-matrices/multiplying-matrices-by-matrices/a/multiplying-matrices
Deep Neural Network (DNN) Source: https://www.kdnuggets.com/wp-content/uploads/deep-neural-network.jpg
NVIDIA GPUs and CUDA • NVIDIA is the most common manufacturer of GPUs • CUDAis the software that allows applications to access the NVIDIA GPU hardware • CUDA library or toolkit • CUDA kernel device driver • GPU-hardware specific
Architectural Challenges • Complexity, lack of repeatability and reproducibility across environments • Sharing data, not duplicating data • Need agility to scale up and down compute resources • Deploying multiple distributed platforms, libraries, applications, and versions • One size environment fits none • Need a flexible and future-proof solution Off-Prem Cluster On-Prem Cluster Laptop
Cost Challenges • How to maximize the use of expensive hardware resources • How to run clusters on heterogeneous host hardware • CPUs and GPUs, including multiple GPU versions • How to minimize manual operations • Automating the cluster creation and and deployment process • Creating reproducible clusters and reproducible results • Enabling on-demand provisioning and elasticity
Support and Security Challenges • How to support the latest versions of software • Compatibility across version upgrades • How to ensure enterprise-class security • Network, storage, user authentication, and access • Applying security patches
Addressing The Challenges Simplify Deployments Innovate Faster Deploy Anywhere Dockeris software that performs operating-system-level virtualization also known as containerization. Containerization allows the existence of multiple instances on a server . Source: https://en.wikipedia.org/wiki/docker_(software)
Surfacing GPU device into a Container • Use of docker –device command line option • Use of vendor specific Docker wrapper such as nvidia-docker (Introduced in 2016) • Use of NVIDIA Container Runtime (Introduced mid 2018) • Support for container runtimes other than Docker
GPU Access from within a Container Source: http://www.nvidia.com/object/docker-container.html
Spark • A unified analytics engine for large-scale data processing • Easy to use • Java, python, scala, R • Runs everywhere • Hadoop, mesos, kubernetes, standalone
TensorFlow • Distributed solution for training models in parallel, on multiple devices, using GPUs • Goal is to improve accuracy and speed
Horovod • An open source framework developed by Uber that supports allreduce • Distributed training framework for • Tensorflow, PyTorch, Keras • Separates infrastructure capabilities from ML • Installs easily on existing ML framework • pip install horovod • Uses bandwidth optimal communication protocol • RDMA, InfiniBand if available
Spark, TensorFlow, NVIDIA, Horovod, on Docker Spark Driver Horovod cluster Spark Executor Spark Executor Spark Executor Horovod cluster on multiple GPUs, containers, and machines NCCL 2.3.7 NCCL 2.3.7 NCCL 2.3.7 MPI 3.1.3 MPI 3.1.3 MPI 3.1.3 TensorFlow 1.9 TensorFlow 1.9 TensorFlow 1.9 GPU / CUDA 9 GPU / CUDA 9 GPU / CUDA 9 Shared Data
Deep Learning and Docker Containers • ML/DL applications are compute hardware intensiveas well as complexto setup and administer • They can benefit from the flexibility, agility, and resource sharing attributes of containerization • But care must be taken in how this is done, especially in a large-scale distributed environment
AI-Driven Solutions for the Enterprise Example Industry Use Cases Solutions Video Surveillance Fraud Detection Genome Research Customer 360 Infrastructure Data Science and ML / DL Tools Data Platforms HDFS/NFS Data Data Store Data Duplication Cloud IT User Access Security Time to Deploy Multi-Tenant
Distributed Deep Learning on Docker Support for Parallelism Prototype Model save, load, share, and run Quickly deploy sandbox nodes in minutes Secure, multi-tenant, elastic environments on shared infrastructure Compute Isolation • On-demand GPU-enabled clusters for deep learning • Submit jobs using notebooks, web UI, or API • Web-based SSH access for CLI jobs • Shared storage access with security • Cloud-ready architecture, using Docker containers CUDA runtime cuDNN Data Isolation Mount GPU devices and surface device drivers Mount GPU devices and surface device drivers DIY / K8S / DCOS / Containerized Infrastructure for AI / ML / DL GPU and Non-GPU Hosts Data Isolation HDFS Data Lake or NFS Enterprise Storage
Challenges Solved • Pre-built Docker images with CUDA and automated cluster creation for the entire stack • Easy to deploy, repeatable • Appropriate NVIDIA kernel module surfaced automatically to the containers • Scalable access to resources (e.g. single node, single GPU, multi-node, multi-GPU combinations) • UI, CLI, and API use patterns (notebooks, web, SSH) • Fast access to shared data • Support of hybrid on premise and cloud infrastructure
Faster ML / DL Deployment Time End User Submit Job / Model SSH / UI Software Management Security (KDC, AD/LDAP) Legacy Deployment User Access (SSH, SSL) Add Services Load Balancing 45 Days ~10 Minutes Port Mapping ~ x Days Add / Configure Libraries Onboard Users Submit Job / Model Init.d Configuration SSH / UI Download / Install Cluster Configuration ~10 Minutes Hardware Networking Security (KDC, AD/LDAP, SSL) Storage Application Image Operating System Docker Physical Server Deployment with BlueData
Lessons Learned • Enterprises are using ML / DL today to solve difficult problems • Example use cases: fraud detection, disease prediction, etc. • Distributed ML / DL in the enterprise requires a complex stack, with multiple different tools • TensorFlow is one popular option • Deployments are challenging, with many potential pitfalls • Containerization can deliver agility and cost saving benefits
Takeaways • Distributed deep learning applications can be deployed on Docker containers • GPU resources can be effectively shared between multiple applications • Deep learning requires a complex software / hardware stack • DevOps complexity can be abstracted from the data scientist with self-service provisioning and automation • Data resources should be decoupled from compute resources in order to maximize platform flexibility and scalability
Recommendations • Start with a few key use cases to explore deep learning • Be prepared – the only constant is change • Business needs and use cases constantly evolve • Tools evolve in dog years • Leverage a flexible, scalable, and elastic platform for success • Containerization can deliver agility and cost saving benefits • But there are significant challenges … DIY will likely be DOA
Please Rate This Session Session page on conference website O’Reilly Events App
Tom Phelan @tapbluedata