380 likes | 512 Views
Learn from Satish Dandu, Michael Balint, and Joshua Patterson on how to accelerate anomaly detection and inferencing by using deep learning and GPU data pipelines.
E N D
Ø Data Platform-as–a-Service Ø Multi-Tenancy Ø Anomaly Detection Platform Ø Training & Inferencing Evolution AGENDA Ø Performance & Learnings Ø GPU Acceleration Ø Dash boarding use case & impetus for GPU- acceleration Ø Performance & Learnings Ø Future & GOAI 2
DATA PLATFORM-AS-A-SERVICE SCALE HIGH AVAILABILITY Handles 1M events/second Auto-scales the cluster automatically Offers HA with no data-loss Always-on architecture Data replication • • • • • SECURITY SELF SERVICE Data platform security has been implemented with VPCs in AWS Dashboard access using NVIDIA LDAP Log-to-analytics Kibana, JDBC access Accessing data using BI tools • • • • • 4
ARCHITECTURE V1 5
DATA PAAS Anomaly Detection (AD) PaaS 9 *Images created from quickmeme.com
ANOMALY DETECTION USING DEEP LEARNING NGC/NGN GPU Cluster GPU Cluster GPU Cloud NGC/NGN Anomaly Detection Data Platform AI Framework (Keras + TensorFlow) Top Features Automated Alerts & Dashboards Early Detection Self Service Better accuracy & less noise 10
ANOMALY DETECTION FRAMEWORK Anomalies: Email alerts, Dashboards Feedback from user Time X1 X2 Y Anomaly Description Anomaly Post- processing: Univariate Analysis 1 X1 0 Anomaly Detection Time X1 X2 X’ X’’ Y Supervised Learning: Logistic Regression Unsupervised Learning: Multivariate-Gaussian 1 0 Feature Learning Algorithm: Recurrent Neural Network (RNN), Autoencoders (AE) Time X1 X2 X’ X’’ Time X1 X2 Raw Dataset 11
ANOMALY DETECTION BENEFITS WITH DEEP LEARNING With DL Without DL Top Features Automated Alerts & Dashboards Early Detection Self Service Better accuracy & less noise 12
ANOMALY DETECTION TRAINING Ø Evolution V1: V2: Multi- GPU support + TensorFlow Serving (Keras + TensorFlow) V0: Manual Feature Creation Automatic Feature Creation using DL (Theano) Ø CPU vs GPU Ø Learnings : Ø Manual feature extraction does not scale Ø Dataset preparation is the long pole Ø Training on CPU takes longer than data collection rate 13
ANOMALY DETECTION INFERENCING Goal : Detect/Inference any anomalies in near-real time based on user activity based on trained models 3. AD platform uses trained model to inference anomalies 1. Aggregated Data output for live data 2. Deep learning Model 4. Sends automated alerts with dashboards 5. Users can label alerts 6. Model gets updated during next training cycle http://www.sci-news.com/othersciences/linguistics/science-mystery-words-gullivers- travels-03135.html 14
V1: DATA PREP FOR INFERENCING Ø Use Case: Detecting anomalies with user’s activity Ø Inferencing flow from 10k feet Live ETL Live Streaming Data AD Platform Streaming Data aggregations for inferencing AD Platform Ø Started with python scripts for windowed aggregation Python Script Performance 200 150 Ø Learnings: Hard to scale for near real time. AD platform runs inferencing every 3 mins as we are impacted by speed of data processing 154 100 103 50 73 0 10 MINS 30 MINS 60 MINS 15
V2: IMPROVING DATA PREP PERFORMANCE V2: To improve performance, we started using Presto with data on S3 in JSON format Ø Live data will be streamed from Kafka to S3. We use Presto for our data warehousing needs Ø Live Streaming Data AD Platform Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data* Ø 35 30 30 25 25 20 Ø Learnings: Presto with Parquet has best performance but we need to batch data at 30 secs interval. So it’s not completely real time 20 15 10 8 5 6 4 0 PRESTO ON JSON PRESTO ON PARQUET 10 mins 30 mins 60 mins 16
DASHBOARDING 17
GPU ACCELERATION Accelerate the pipeline, not just deep learning! Ø GPUs for deep learning = proven Data Ingestion Ø Where else and how else can we use GPU acceleration? Ø Dashboards Data Processing Ø Accelerating data pipeline Ø Stream processing Visualization Inferencing Model Training Ø Building better models faster Ø First: GPU databases 18
1+ BILLION TAXI RIDES BENCHMARK Query 1 Query 2 Query 3 Query 4 10190 8134 19624 85942 5000 4500 4000 3500 2970 3000 2500 2250 2000 1560 Time in Milliseconds 1500 1250 1000 696 372 500 269 150 99 80 30 21 0 MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82 19
MAPD + IMMERSE VS ELASTIC + KIBANA MapD Core Elastic + Kibana Very fast OLAP queries Fantastic for complex search • • JIT LLVM query compiler Scales easily (up to a point) • • GPUs for compute Indexing consumes more storage (~4-6x) • • CPUs for parse + ingest Kibana for KPI dashboarding? • • Limited join support (for now) • Concurrency? • Immerse c3/d3 + crossfilter = nice dashboards • Backend rendering • 20
ARCHITECTURE V1 21
ARCHITECTURE V2 (with MapD) 22
MAPD VS KIBANA Dashboards Comparison + Performance Test Method 23
DASHBOARD PERFORMANCE MapD Immerse vs Elastic Kibana 300 MapD Immerse (DGX) x MapD Immerse (P2) Elastic Kibana Time to Fully Load (seconds) 200 100 < 12s < 9s 0 1 6 11 16 21 26 31 Days of Data 24
Kratos Artificial Intelligence V3: Data Prep using GPU acceleration V3: Explored GPU databases like MapD to improve the performance for querying on streaming live data • • MapD offers constant query response times MapD has some SQL limitations. We use Presto as an interface & built a “MapD-> Presto” connector for full ANSI Sql features • GPU Database Performance 35 30 Execution Time (seconds) 30 25 25 20 20 Live Streaming Data AD Platform 15 8 10 6 4 5 1.2 1.2 1.2 0.1 0.1 0.1 0 PRESTO ON JSON PRESTO ON PARQUET MAPD PRESTO + MAPD 10 mins 30 mins 60 mins 25
FUTURE 26
EXPAND GPU USAGE More Data, Less Hardware Peak Double Precision 8.0 7.0 6.0 TFLOPS Scaling up and out with GPU 5.0 4.0 3.0 2.0 1.0 0.0 27 2008 2010 2012 2014 2016 2017 NVIDIA GPU x86 CPU
EXPAND GPU USAGE Internal Logs to Cyber Security Cyber Security Analytics Platform 28
GOAI ECOSYSTEM End To End Data Science A Python open-source just-in-time optimizing compiler that uses LLVM to produce native machine instructions Dask is a flexible parallel computing library for analytic computing with dynamic task scheduling and big data collections. 29
GOAI ECOSYSTEM End To End Data Science Graph Analytics & Visualization A CUDA library for graph-processing designed specifically for the GPU. It uses a high-level, bulk-synchronous, data-centric abstraction focused on vertex or edge operations. SIEM attack escalation Dropbox external sharing logs 30
BETTER DATA PIPELINES User Defined Functions at Scale https://github.com/gpuopenanalytics libgdf, pygdf, dask_gdf 31
BETTER DATA PIPELINES HIVE to BlazingDB 32
BETTER DATA PIPELINES More Models nvGRAPH https://github.com/h2oai/h2o4gpu # edges = E * 2^S ~34M 33
JOIN THE REVOLUTION Everyone Can Help! GPU Open Analytics Initiative http://gpuopenanalytics.com/ APACHE ARROW APACHE PARQUET https://arrow.apache.org/ https://parquet.apache.org/ @ApacheArrow @ApacheParquet @Gpuoai Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed! 34
NVIDIA Team Ed Clune Ayush Jaiswal THANK YOU! Keith Kraus Rohit Kulkarni Jarod Maupin Nixs Nataraja Rohan Somvanshi Michael Wendt 35