180 likes | 201 Views
Learn how containerizing Spark at RBC enables customizability, provisioning, predictability, security, and infrastructure optimization for efficient big data processing. Explore containerized Spark deployment modes, cluster architecture, and comparison with native Kubernetes and YARN. Discover how Kerberized HDFS ensures data storage security and see the benefits of containerized Spark on Openshift with logging, monitoring, and improved scalability.
E N D
Containerized Spark at RBC Raj Channa & Dhwanil Raval
Big Data and Spark Security Data Format Parquet, Avro, ORC, Arrow Metadata Management Coordinate & Management Scheduler SQL-over Hadoop Scripting Stream Processing Machine Learning In-Memory Processing NoSQL Database Search Engine Data Piping Resource Management Storage
Why Containerize Spark? • Customizability: Ability to run different versions of Spark on the same platform • Provisioning: Provision Spark Clusteron demand in an automated manner for each production job. • Predictability: Dedicated cores allows for consistent run time and predictable SLAs for batch jobs • Security: Vulnerability assessment and scanning of container at deployment time • Infrastructure Optimization: Efficient infrastructure utilization and resource sharing
High Level Overview of what we ended up doing Spark Jobs Spark SQL Spark Streaming Spark MLIB Spark GraphX Other Container workloads Spark Standalone Scheduler Spark Core Engine Kubernetes Spark Standalone Scheduler Kubernetes YARN Mesos
Apache Spark Deployment Modes & Building Blocks (Driver/Master & Executor/Worker) JOB LAUNCHING ENVIRONMENT CLUSTER Deploy-mode Cluster NODE NODE NODE Spark Driver Spark Executor Spark-Submit NODE NODE NODE Spark Executor Spark Executor Deploy-mode Client NODE NODE NODE Spark Executor Spark-Submit Spark Driver NODE NODE NODE Spark Executor Spark Executor
Spark cluster on Openshift using Cluster Mode using K8S as resource manager spark-submit to K8S E E E D E E Client
Oshinko using Spark Standalone cluster Manager with Client Mode spark-submit to Spark Master (not K8S) Deploy Spark-Cluster W M W Oshinko CLI Zero worker replicas W W W Scale up worker pods Client
How did Oshniko compare against Native K8S? Allow for keytab-based HDFS security in Standalone mode https://issues.apache.org/jira/browse/SPARK-5158
Kerberized HDFS as Storage for Spark – How YARN Does this? Node Manager (Spark Executor) 3 Spark Client YARN Application Mater 2 Spark Client gets Kerberos ticket using kinit Spark Client executes spark-submit with options –keytab and –principal YARN Distributes HDFS delegation tokens to all worker nodes Spark reads and stores data from and to HDFS using delegation Token Node Manager (Spark Executor) 3 1 4 KDC HDFS
Kerberized HDFS as Storage for Spark – Scenario for Standalone Spark Standalone Cluster on Openshift Spark Worker Spark-env.sh 4 Spark Client Spark Master 3 2 Spark processes starts with kinit and gets hdfs service ticket Spark workers convert the kerberos tickets to delegation token at startup using post hooks Spark Client submits job with token location 4. Spark executors reads and stores data from and to HDFS using delegation token generated 1 Spark-env.sh Spark Worker KDC HDFS Spark-env.sh
Custom template using Spark Standalone cluster Manager with Client Mode spark-submit to Spark Master (not K8S) Deploy Spark-Cluster W M W Spark-template.yaml Zero worker replicas W W W Scale up worker pods Client
How do the 3 options compare? Standalone Stateful Set Allow for keytab-based HDFS security in Standalone mode https://issues.apache.org/jira/browse/SPARK-5158
Spark cluster on Openshift – Logging W M W W W W
Spark cluster on Openshift – Monitoring Using Prometheus and Grafana All Pods Expose metrics on Port 7777 W M W W W W 7777
What was the result? • Before • After • Time to provision environment • Up to 3 months • 45 seconds • Scaling down to zero • Not possible • 30 seconds and automated • Hardwarerequirements • 33% less hardware • Securityscanning for each run • None • Automated scanning for every container on each deployment • Run times • Variable • Predictable