Containerized Spark at RBC

Containerized Spark at RBC Raj Channa & Dhwanil Raval

Big Data and Spark Security Data Format Parquet, Avro, ORC, Arrow Metadata Management Coordinate & Management Scheduler SQL-over Hadoop Scripting Stream Processing Machine Learning In-Memory Processing NoSQL Database Search Engine Data Piping Resource Management Storage

Why Containerize Spark? • Customizability: Ability to run different versions of Spark on the same platform • Provisioning: Provision Spark Clusteron demand in an automated manner for each production job. • Predictability: Dedicated cores allows for consistent run time and predictable SLAs for batch jobs • Security: Vulnerability assessment and scanning of container at deployment time • Infrastructure Optimization: Efficient infrastructure utilization and resource sharing

High Level Overview of what we ended up doing Spark Jobs Spark SQL Spark Streaming Spark MLIB Spark GraphX Other Container workloads Spark Standalone Scheduler Spark Core Engine Kubernetes Spark Standalone Scheduler Kubernetes YARN Mesos

Apache Spark Deployment Modes & Building Blocks (Driver/Master & Executor/Worker) JOB LAUNCHING ENVIRONMENT CLUSTER Deploy-mode Cluster NODE NODE NODE Spark Driver Spark Executor Spark-Submit NODE NODE NODE Spark Executor Spark Executor Deploy-mode Client NODE NODE NODE Spark Executor Spark-Submit Spark Driver NODE NODE NODE Spark Executor Spark Executor

Openshift Architecture

Spark cluster on Openshift using Cluster Mode using K8S as resource manager spark-submit to K8S E E E D E E Client

What did we learn?

Oshinko using Spark Standalone cluster Manager with Client Mode spark-submit to Spark Master (not K8S) Deploy Spark-Cluster W M W Oshinko CLI Zero worker replicas W W W Scale up worker pods Client

How did Oshniko compare against Native K8S? Allow for keytab-based HDFS security in Standalone mode https://issues.apache.org/jira/browse/SPARK-5158

Kerberized HDFS as Storage for Spark – How YARN Does this? Node Manager (Spark Executor) 3 Spark Client YARN Application Mater 2 Spark Client gets Kerberos ticket using kinit Spark Client executes spark-submit with options –keytab and –principal YARN Distributes HDFS delegation tokens to all worker nodes Spark reads and stores data from and to HDFS using delegation Token Node Manager (Spark Executor) 3 1 4 KDC HDFS

Kerberized HDFS as Storage for Spark – Scenario for Standalone Spark Standalone Cluster on Openshift Spark Worker Spark-env.sh 4 Spark Client Spark Master 3 2 Spark processes starts with kinit and gets hdfs service ticket Spark workers convert the kerberos tickets to delegation token at startup using post hooks Spark Client submits job with token location 4. Spark executors reads and stores data from and to HDFS using delegation token generated 1 Spark-env.sh Spark Worker KDC HDFS Spark-env.sh

Custom template using Spark Standalone cluster Manager with Client Mode spark-submit to Spark Master (not K8S) Deploy Spark-Cluster W M W Spark-template.yaml Zero worker replicas W W W Scale up worker pods Client

How do the 3 options compare? Standalone Stateful Set Allow for keytab-based HDFS security in Standalone mode https://issues.apache.org/jira/browse/SPARK-5158

Spark cluster on Openshift – Logging W M W W W W

Spark cluster on Openshift – Monitoring Using Prometheus and Grafana All Pods Expose metrics on Port 7777 W M W W W W 7777

Demo

What was the result? • Before • After • Time to provision environment • Up to 3 months • 45 seconds • Scaling down to zero • Not possible • 30 seconds and automated • Hardwarerequirements • 33% less hardware • Securityscanning for each run • None • Automated scanning for every container on each deployment • Run times • Variable • Predictable

Containerized Spark at RBC

Containerized Spark at RBC

Presentation Transcript

RBC TRAINING

Spark

Blood (RBC)

Spark

Spark

Spark

RBC

Spark

Spark

Spark

Spark

RBC-AMT32E With RBC-SMT1 Control

Spark

Spark

The Diversity Journey at RBC

RBC

RBC

NASA GSFC Containerized Computing

Containerized Wastewater Plants

Spark

BIODISCS (RBC)