Exploring Hadoop Deployment Models for Big Data: Traditional vs. Cloud

Hadoop in the Cloud: Good Fit or Round Peg in a Square Hole? • Tom Phelan – co-founder & chief architect • Joel Baxter – software engineer

Outline • Traditional vs. New Deployment Models for Big Data • Big Data Use Case Characteristics • Rules for Mapping Use Cases to Deployment Models • Deployment Options Selected for Specific Use Cases • What’s Next … • Big Data Deployment Model Evolution

Traditional Hadoop Deployment Model • On-premises with bare metal • Co-location of compute & storage resources • High performance • Resources cannot be scaled independently • Challenging to manage, scale, upgrade • Used by a single big data application • Leads to proliferation of big data clusters

Traditional Hadoop Deployment Architecture Server Server Server NodeManager NodeManager ResourceManager DataNode DataNode NameNode Local Storage Local Storage Local Storage Local Storage

Enterprises Use Multiple Deployment Models Today Source: EnterpriseStrategy Group (ESG) Survey

New Hadoop / Big Data Deployment Models • Public cloud or on-premises with virtualization • Separation of compute & storage resources • Resources scaled independently • Potential impact on performance* • Easy to manage, scale, upgrade • Easy to support multiple big data clusters * we’ll explore this next

Hadoop Deployment Architecture in Public Cloud Virtual Machine Virtual Machine Shared File System or Object Storage Virtual Machine NodeManager NodeManager ResourceManager hdfs:// s3a:// hdfs:// s3a:// hdfs:// s3a://

Hadoop Deployment Architecture with Virtualization Virtual Machine Virtual Machine Virtual Machine Shared Block Storage NodeManager NodeManager ResourceManager DataNode DataNode NameNode Vdisk Vdisk Vdisk Vdisk Vdisk Vdisk Vdisk Vdisk

Key Trade-offs for Traditional vs. New Deployment • Performance • Agility, expandability, manageability Can you get both? Yes!

WHAT ABOUT PERFORMANCE? Hadoop in Public Cloud (Google) vs. Traditional Hadoop AccentureStudy Source: Accenture Labs, Cloud-Based Hadoop Deployments Study

WHAT ABOUT PERFORMANCE? Hadoop On-Premises with Virtualization and Shared Storage vs. Traditional Hadoop VCE Vblock with EMC Isilon Scale-out NAS vs. Traditional Hadoop Source: ESG Lab Review, VCE Vblock with EMC Isilon for Hadoop

Examples: Performance Results Technology Labs US Federal Agency Lab “In fact, we got better results with EMC Isilon than we would have in a traditional cluster.” “I/O virtualization overhead did not cause a noticeable reduction in performance during our experiment. “In addition, our experiments prove that using remote storage to make data highly available outperforms local disk HDFS relying on data locality.” “In fact, the performance against enterprise NFS is as good as or better than local HDFS in all cases.”

Traditional vs. New Big Data Deployment Models • There are reasons to choose one deployment model over the other but, depending on your use case, performance may not be one of them.

Hadoop in Public Cloud Tradeoffs • Cost • Potentially lower initial costs than bare metal • Ongoing operational costs can be significant • Data security / regulatory / governance requirements • Examples: AWS, Microsoft Azure, Google, many more

On-Premises with Virtualization: VM or Container Virtual machines Containers Source: https://www.docker.com/what-docker

Hadoop On-Premises with Virtualization Tradeoffs • Cost • Initial costs and ongoing operational costs lower than bare metal • Performance and security tradeoffs depend on virtualization technology • Hypervisor (Virtual Machines) • Performance: CPU tax • Security: Strong isolation and fault containment • Examples: VMware BDE, OpenStack Sahara • Linux Containers • Performance: No CPU tax • Security: Isolation and fault containment still developing • Example: BlueData EPIC

Big Data Use Case Characteristics

Key Big Data Use Case Characteristics • Cluster creation / expansion / upgrade time • Job runtime • Data security • Cost • These are important decision factors • when selecting a big data deployment environment

Cluster Creation / Expansion / Upgrade Time • How often will new software be deployed? • How long will deployment take? • How fast can an existing cluster be expanded? • How fast can the software in an existing cluster be upgraded?

Job Runtime • Nobody wants their job to run slow • There is always a performance vs. cost tradeoff • Lower cost and higher performance SSDs and network have changed traditional tradeoff assumptions • Job runtime includes the time required to: • Ingest data into the cluster, if required • Export data out of the cluster, if required

Data Security • HIPAA, ISO 27001, etc. • Kerberos • Multiple physical copies of the data • “Each copy of my data is another security violation”

Cost • Initial Cost: CapEx • Creation of the first cluster • Continuing Cost: OpEx • If the cluster is not the storage system of record, then data duplication costs • Expansion Cost: CapEx/OpEx • Creation of second, third, fourth, etc. cluster

Rules for Mapping Big Data Use Cases to Deployment Options

Rules of Thumb: General Guidelines Only • If there is one, and only one, cluster: • On-premises deployment on bare metal • If there is more than one cluster: • Public cloud or on-premises with virtualization • If agility and rapid cluster creation is a priority: • Public cloud or on-premises with virtualization • If minimal upfront cost is a priority: • On-premises deployment with bare metal is out

Rules of Thumb (cont’d) • If there are strong data security requirements: • Off-premise deployment in public cloud is out (in general) • If need direct access to hardware features: • On-premises deployment with bare metal or with virtualization using Linux containers • If require strong cluster security / fault containment: • On-premises deployment with bare metal or with virtualization using hypervisor

How do these Big Data Deployment Options Fit Various Use Cases?

Use Case: Large Internet Portal • Large amounts of data generated on-prem • Configuration dominated by one giant cluster • In-house customized analytics & management software • Army of IT staff familiar w/ software • Only Apache Hadoop distro in use • Always looking for max performance • Big budget • Deployment: • On-premises with bare metal

Use Case: Netflix • Lots of data • Mostly no data security standards • Data is already in the cloud • Large and frequent capacity expansion and contraction • Variable load • Deployment: • Public cloud

Use Case: Large U.S. Financial Institution • Data security requirements: personal and financial data • Need many clusters • Want to offload low priority jobs from production cluster • But low priority jobs need to access production data • Want to avoid copying production data • Naive users • Clusters have internet access • Potential download of harmful software • Deployment: • On-premises with virtualization, using hypervisor

Use Case: Medical Company • Data security requirements: HIPAA • Need many clusters • Want to offload low priority jobs from production cluster • But low priority jobs need to access production data • Want to avoid copying production data • Need to use custom tools and analytics applications • Want to have tools deployed automatically • Deployment: • On-premises with virtualization, using containers or hypervisor

Use Case: State University • Data security requirements: none • Teaching environment: need many short-lived clusters • Students responsible for cluster creation • Students forget to delete clusters • Deployment: • On-premises with virtualization, using containers or hypervisor

Use Case: National Government Lab • Data security requirements: government regulations • Need multiple clusters • Want to offload low priority jobs from production cluster • But low priority jobs need to access production data • Want to avoid copying production data • Need to leverage CPU acceleration hardware • GPUs • Cannot make copies of data. Must be processed in place • Deployment: • On-premises with containers accessing remote storage

The Rules are Changing for Big Data Deployment

Big Data Deployment Model Evolution • In the enterprise, users want more … • They want the benefits of the public cloud: • Self-service, elasticity, and lower upfront cost • They also want: • Multi-tenancy, data security, access to shared storage • Quotas, prioritization, QoS / resource-sharing • They want more than Hadoop ... now it’s Spark + more • Increasingly, they may want on-premises + public cloud

Big Data Deployment Model Evolution

Big-Data-as-a-Service (BDaaS) is Emerging • More than Hadoop-as-a-Service (and much more than IaaS) • Enable Spark-as-a-Service, Analytics-as-a-Service, and more • Multiple frameworks (Hadoop, Spark, Flink, Sanza, Cassandra, etc.) • Both batch and real-time, support for BI / ETL / analytics tools, etc. • Integrated stack configured for big data jobs • Provide & hide the appropriate options for admins and end-users • Multi-tenant security and governance for enterprise needs • With QoS controls and priority management of users, jobs, and data • Support for ‘hybrid’ cloud model

Before too much longer we may be asking if • bare metal Hadoop is the round peg in the square hole …

Q & A • twitter: @tapbluedata • email: tap@bluedata.com • email: joel@bluedata.com • www.bluedata.com • Visit our booth in the Expo is BDaaS

Exploring Hadoop Deployment Models for Big Data: Traditional vs. Cloud