180 likes | 359 Views
Multi-Data-Center Hadoop in a Snap. Dr. Konstantin Boudnik Vice President, Open Source Development. My background. 15 years Sun Microsystems veteran: JVM, distributed systems Vice President, Apache Bigtop Committer, PMC & contributor to various ASF projects Member of Apache IPMC
E N D
Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development
My background • 15 years Sun Microsystems veteran: JVM, distributed systems • Vice President, Apache Bigtop • Committer, PMC & contributor to various ASF projects • Member of Apache IPMC • Early Hadoop committer
WANdisco Background • WANdisco: Wide Area Network Distributed Computing • Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability • Leader in tools for software engineers – Subversion • Apache Software Foundation sponsor • Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND) • US patented active-active replication technology granted, November 2012 • Global locations • San Ramon (CA) • Chengdu (China) • Tokyo (Japan) • Boston (MA) • Sheffield (UK) • Belfast (UK)
Non-Stop Hadoop Non-Intrusive Plugin Provides Continuous Availability In the LAN / Across the WAN Active/Active
3 Key Problems For Multi Cluster Hadoop LAN / WAN
Enterprise Ready Hadoop Characteristics of Mission Critical Applications • Require 100% Uptime of Hadoop • SLA’s, Regulatory Compliance • Require HDFS to be Deployed Globally • Share Data Between Data Centers • Data is Consistent and Not Eventual • Ease Administrative Burden • Reduce Operational Complexity • Simplify Disaster Recovery • Lower RTO/RPO • Allow Maximum Utilization of Resource • Within the Data Center • Across Data Centers
Breaking Away from Active/Passive What’s in a NameNode Single Standby Active / Active All resources utilized Only NameNode configuration Scale as the cluster grows All NameNodes active Load balancing Set resiliency (# of active NN) Global Consistency • Inefficient utilization of resource • Journal Nodes • ZooKeeper Nodes • Standby Node • Performance Bottleneck • Still tied to the beeper • Limited to LAN scope
Breaking Away from Active/Passive What’s in a Data Center Standby Datacenter Active / Active DR Resource Available Ingest at all Data Centers Run Jobs in both Data Centers Replication is Multi-Directional active/active Absolute Consistency Single HDFS spans locations ‘N’ Data Center support Global HDFS allows appropriate data to be shared • Idle Resource • Single Data Center Ingest • Disaster Recovery Only • One way synchronization • DistCp • Error Prone • Clusters can diverge over time • Difficult to scale > 2 Data Centers • Complexity of sharing data increases
Multiple Clusters One Cluster Aproach • Example Applications • HBASE • RT Query • Map Reduce • Poor Resource Management • Data Locality Issues • Network Use • Complex
Multiple Clusters Creating Multiple Clusters • Example Applications • HBASE • RT Query • Map Reduce • Need to share data between clusters • DistCp / Stale Data • Inefficient use of storage and or network • Some clusters may not be available
Cluster Zones Zoning for Optimal Efficiency 1 100% HDFS Consistency
Multi Datacenter Hadoop Disaster Recovery Absolute Consistency Maximum Resource Use Lower Recovery Time/Point WAN REPLICATION Replicate Only What You Want Better Utilization of Power/Cooling Lower TCO LAN Speed Performance
Technical Use Cases • Eliminate Performance Bottleneck • HBASE issues • Multi Data-Center Ingest • Information doesn't need to be sent to one DC and then copied back to the other using DistCP • Parallel ingest methods don’t require redirected data streams • Ingest data at, or close to the source • Global Analysis (Logs, Click Streams, etc…) • Cluster Zones • Efficient use of resource based on application profile • HBASE, MapReduce, SPARK, etc… • Maximize Data Center Resource Utilization • All datacenters can be used to run different jobs concurrently • Disaster Recovery • Data is as current as possible (no periodic synchs) • Virtually zero downtime to recover from regional data center failure • Regulatory compliance