Multi-Data-Center Hadoop in a Snap

Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development

My background • 15 years Sun Microsystems veteran: JVM, distributed systems • Vice President, Apache Bigtop • Committer, PMC & contributor to various ASF projects • Member of Apache IPMC • Early Hadoop committer

WANdisco Background • WANdisco: Wide Area Network Distributed Computing • Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability • Leader in tools for software engineers – Subversion • Apache Software Foundation sponsor • Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND) • US patented active-active replication technology granted, November 2012 • Global locations • San Ramon (CA) • Chengdu (China) • Tokyo (Japan) • Boston (MA) • Sheffield (UK) • Belfast (UK)

Customers

Non-Stop Hadoop Non-Intrusive Plugin Provides Continuous Availability In the LAN / Across the WAN Active/Active

3 Key Problems For Multi Cluster Hadoop LAN / WAN

Enterprise Ready Hadoop Characteristics of Mission Critical Applications • Require 100% Uptime of Hadoop • SLA’s, Regulatory Compliance • Require HDFS to be Deployed Globally • Share Data Between Data Centers • Data is Consistent and Not Eventual • Ease Administrative Burden • Reduce Operational Complexity • Simplify Disaster Recovery • Lower RTO/RPO • Allow Maximum Utilization of Resource • Within the Data Center • Across Data Centers

Breaking Away from Active/Passive What’s in a NameNode Single Standby Active / Active All resources utilized Only NameNode configuration Scale as the cluster grows All NameNodes active Load balancing Set resiliency (# of active NN) Global Consistency • Inefficient utilization of resource • Journal Nodes • ZooKeeper Nodes • Standby Node • Performance Bottleneck • Still tied to the beeper • Limited to LAN scope

Breaking Away from Active/Passive What’s in a Data Center Standby Datacenter Active / Active DR Resource Available Ingest at all Data Centers Run Jobs in both Data Centers Replication is Multi-Directional active/active Absolute Consistency Single HDFS spans locations ‘N’ Data Center support Global HDFS allows appropriate data to be shared • Idle Resource • Single Data Center Ingest • Disaster Recovery Only • One way synchronization • DistCp • Error Prone • Clusters can diverge over time • Difficult to scale > 2 Data Centers • Complexity of sharing data increases

Multiple Clusters One Cluster Aproach • Example Applications • HBASE • RT Query • Map Reduce • Poor Resource Management • Data Locality Issues • Network Use • Complex

Multiple Clusters Creating Multiple Clusters • Example Applications • HBASE • RT Query • Map Reduce • Need to share data between clusters • DistCp / Stale Data • Inefficient use of storage and or network • Some clusters may not be available

Cluster Zones Zoning for Optimal Efficiency 1 100% HDFS Consistency

Multi Datacenter Hadoop Disaster Recovery Absolute Consistency Maximum Resource Use Lower Recovery Time/Point WAN REPLICATION Replicate Only What You Want Better Utilization of Power/Cooling Lower TCO LAN Speed Performance

Architecture of a Non-Stop Hadoop

Technical Use Cases • Eliminate Performance Bottleneck • HBASE issues • Multi Data-Center Ingest • Information doesn't need to be sent to one DC and then copied back to the other using DistCP • Parallel ingest methods don’t require redirected data streams • Ingest data at, or close to the source • Global Analysis (Logs, Click Streams, etc…) • Cluster Zones • Efficient use of resource based on application profile • HBASE, MapReduce, SPARK, etc… • Maximize Data Center Resource Utilization • All datacenters can be used to run different jobs concurrently • Disaster Recovery • Data is as current as possible (no periodic synchs) • Virtually zero downtime to recover from regional data center failure • Regulatory compliance

Non-Stop Hadoop Demonstration

Q & A

Thank you

Multi-Data-Center Hadoop in a Snap

Multi-Data-Center Hadoop in a Snap

Presentation Transcript

Hive: A data warehouse on Hadoop

SNAP Data Recorder

Designing Hadoop for the Enterprise Data Center

The Data Center and Hadoop

Results in a SNAP

Words in a “Snap”

Words in a “Snap”

Words in a “Snap”

Words in a “Snap”

BALANCED DATA LAYOUT IN HADOOP

Benefits Of A Multi-Facility Data Center Colocation Plan

Big Data Hadoop as a Services

Embedded Data Visualization in a Hadoop Environment

Hadoop – A Big Data TOOLKIT

Best Big data hadoop courses Training Center in South Delhi

Big Data Hadoop

Big Data Hadoop Training | Big Data Hadoop Courses | Hadoop Online Training

Hadoop Training in Bangalore | Big Data Hadoop Training Institutes

big data and hadoop training in bangalore | big data and hadoop course

hadoop Data Management

Hadoop Big Data

Pediatric Oncology Data Collection in Multi-Center Studies