450 likes | 479 Views
Running spark Clusters in Containers with Docker. Silicon Valley Big Data Association Meetup February 16, 2016 Tom Phelan tap@bluedata.com Kartik Mathur kartik@bluedata.com. Outline. Vocabulary Big Data New Realities Apache Spark Anatomy of a Spark Cluster
E N D
Running spark Clusters in Containers with Docker Silicon Valley Big Data Association Meetup February 16, 2016 Tom Phelan tap@bluedata.com Kartik Mathur kartik@bluedata.com
Outline • Vocabulary • Big Data New Realities • Apache Spark • Anatomy of a Spark Cluster • Deployment Options: Public Cloud, On-Premises • Demo • Trade-Offs and Choices
Vocabulary • Bare-Metal • Virtual Machine (VM) • Container • Docker • Microservice • Monolithic (service)
Apache Spark Apache Spark™ is a fast and general engine for large-scale data processing. Source: www.spark.apache.org
Big Data Deployment Options Source: Enterprise Strategy Group (ESG) Survey, 2015
Spark On-Premises • Individual developers or data scientists who build their own infrastructure on laptops, on VMs, or bare-metal machines • IT takes a bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case
Why Change this Approach? As the number of Spark users grow … • IT needs to scale the deployment for additional use cases • Application lifecycle requires dev/test/QA/prod environments • Complexity overwhelms the organization, restricting adoption
Spark Adoption On-Premises Prototyping Departmental Spark-as-a-Service Get started with Spark for initial use cases and users Evaluation, testing, development, and QA Prototype multiple data pipelines quickly Spin up dev/test clusters with replica image of production QA/UAT using production data without duplication Offload specific users and workloads from production LOB multi-tenancy with strict resource allocations Bare-metal performance for business critical workloads Self-service, shared infrastructure with strict access controls Multi-Tenant Spark Deployment On-Premises Spark in a Secure Production Environment Dev/Test and Pre-Production
Big Data New Realities Big Data Traditional Assumptions Big Data New Realities New Benefits and Value Bare-metal Containers and VMs Big-Data-as-a-Service Data locality Compute and storage separation Agility and cost savings Data on local disks In-place access on remote data stores Faster time-to-insights
New Realities, New Requirements • Software flexibility • Multiple distros, Hadoop and Spark, multiple configurations • Support new versions and apps as soon as they are available • Multi-tenant support • Data access and network security • Differential Quality of Service (QoS) • Stability, Scalability, Cost, Performance, and Security are always important
Big Data Deployment – Public Cloud • Hadoop-as-a-Service • Amazon Web Services EC2 and EMR • Microsoft Azure HDInsight • Google Cloud Dataproc • IBM Bluemix... and others • Spark-as-a-Service • All of the above • Databricks
Big Data Deployment – On-Premises • Bare-Metal • Virtual Machines • VMware Big Data Extensions • OpenStack Sahara • Containers • Mesos • BlueData
Running Spark in Cluster Mode Source: http://spark.apache.org/docs/1.3.0/cluster-overview.html
Common Deployment Patterns Most Common Spark Deployment Environments (Cluster Managers) 48% 40% 11% Standalone mode YARN Mesos Source: Spark Survey Report, 2015 (Databricks)
Spark Cluster – Standalone Mode Spark Client Bare Metal Virtual Machine Spark Master Bare Metal Bare Metal Bare Metal Virtual Machine Virtual Machine Virtual Machine Spark Slave Spark Slave Spark Slave task task task task task task task task task
Spark Cluster – Hadoop YARN Spark Client Resource Manager Spark Master Node Manager Node Manager Node Manager Spark Executor Spark Executor Spark Executor task task task task task task task task task
Spark MultiCluster + YARN Worker Worker Controller Worker Controller Worker
Spark Cluster – Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task
Spark Cluster – Mesos Spark Framework for Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task
Spark Cluster – Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task
Spark Cluster – Hadoop YARN Spark Client/ Zeppelin Virtual Machine Resource Manager Spark Master Virtual Machine Virtual Machine Virtual Machine Node Manager Node Manager Node Manager Spark Executor Spark Executor Spark Executor task task task task task task task task task
AWS Spark-as-a-Service: Benefits • Amazon EC2 Elastic Container Service (ECS) • Launch containers on EC2 • Amazon Elastic Container Registry (ECR): DockerImages • Amazon Elastic MapReduce (EMR) • Easy to use • Low startup costs: Hardware and human • Expandable
AWS Spark-as-a-Service: Challenges • Data access • Already exists in S3 • Ingest time • Data security • Software versions • Spark 1.6.0, Hadoop 2.71; MapR • Cost • Short running vs. long running clusters
On-Premises – Spark + Containers + DCOS Microservices deployment Spark with Docker and Kubernetes/Swarm/Mesos
Spark Cluster – Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task Containers
Spark + Docker + DCOS: Benefits • Easy to set up a dev/demonstration environment • Mesos framework for Spark available • Container isolation • Most of the pieces are available • Complete control • Customization
Spark + Docker + DCOS: Challenges • Can be difficult to set up a production environment • Multi-tenancy, QoS • Software interoperability • Container cluster network connectivity and security
Spark + Docker + Mesos: Challenges Mesos Master Mesos Slave #1 Mesos Slave #2 Marathon Scheduler Container Task Container Task Container Task Mesos Exec Mesos Exec Mesos Scheduler Container Task Container Task Mesos Scheduler Mesos Exec Name Node Mesos Exec Container Data Node Container Data Node Mesos Scheduler Mesos Exec Container Task
Spark + Docker + Mesos + Myriad Myriad Scheduler Mesos Master Mesos Slave #1 Mesos Slave #2 Marathon Scheduler Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Mesos Exec Mesos Exec Mesos Exec Mesos Exec Mesos Exec Mesos Scheduler Mesos Exec Mesos Exec Mesos Scheduler Container Data Node Container Data Node Mesos Scheduler Mesos Exec Name Node Mesos Exec Container Task
Spark + Docker + Mesos (microservice) Myriad Scheduler Mesos Master Mesos Slave #1 Mesos Slave #2 Marathon Scheduler Container Task Container Task Task Task Container Task Task Mesos Exec Mesos Exec Container Task Job Mesos Scheduler Container Task Mesos Exec Mesos Scheduler Mesos Exec Name Node Mesos Exec Container Data Node Container Data Node Mesos Scheduler Container Task
On-Premises – Spark + Containers + BlueData Monolithic deployment Spark-as-a-Service in an On-Premises Deployment
Spark – Standalone with Containers Spark Client Bare Metal Container Virtual Machine Spark Master Bare Metal Bare Metal Bare Metal Virtual Machine Virtual Machine Virtual Machine Container Container Container Spark Slave Spark Slave Spark Slave task task task task task task task task task
Spark + Docker + BlueData: Benefits • Enterprise quality • Deployment flexibility (on physical servers or VMs) • Network connectivity • Persistent IP addresses • Externally visible IP addresses • No NATing required • Cloud-like experience: Spark-as-a-Service • Self-service access to instant clusters, simple Web UI
Spark + Docker + BlueData: Benefits • Dockerpackaging of images • Distribution agnostic • Spark, Kafka, Cassanda, Zeppelin, and more • With or without YARN • Bring your own BI/analytics tool • Currently only on-premises • Future: on-premises, public cloud, or hybrid
Spark + Docker + BlueData: Benefits • Multi-tenancy • Per tenant QoS, not per service • Private VLAN per Tenant • Limit Data Access • HA, software upgrades, data access, … • BlueData’s DataTap isolates data from compute • Upgrade compute independent of data
Trade-Offs (Not Unique to Spark) Less Stable Less Later More Later More Stable Open Source Proprietary On-Premises Public Cloud Less Cost More Cost More Now Less Now
Use Cases Choice of Deployment • Just Spark, Just Works, no Customizations • Public Cloud • Lots of Customizations, Willing to Tinker, Limited QoS • Opensource, microservice, Mesos • Configurable, Flexible, Enterprise Multi-Tenancy • Monolithic (for the moment) container deployment
Thank You www.bluedata.com Try BlueData EPIC for Free: bluedata.com/free