1 / 18

Hadoop YARN in the Cloud

Hadoop YARN in the Cloud. Junping Du Staff Engineer, VMware China Hadoop Summit, 2013. Agenda. Hadoop YARN – Hub for Big Data Applications YARN and Cloud Computing HVE ( Hadoop Virtualization Extension) work on YARN. Hadoop MapReduce v1 (Classic). JobTracker

jodie
Download Presentation

Hadoop YARN in the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013

  2. Agenda • Hadoop YARN – Hub for Big Data Applications • YARN and Cloud Computing • HVE (Hadoop Virtualization Extension) work on YARN

  3. HadoopMapReduce v1 (Classic) • JobTracker • Manage cluster resources and job scheduling • TaskTracker • Per node agent • Manage tasks

  4. MapReduce v1 Limitations • Scalability • Manage cluster resources and job scheduling • SPOF (Single Point Of Failure) • JobTracker failure cause all queued and running job failure • Restart is very tricky due to complex state • Hard partition of resources into map and reduce slots • Low resource utilization • Lacks support for alternate paradigms • Lack of wire-compatible protocols

  5. YARN Architecture • Splits up the two major functions of JobTracker • Resource Manager (RM) - Cluster resource management • Application Master (AM) - Task scheduling and monitoring • NodeManager (NM) - A new per-node slave • launching the applications’ containers • monitoring their resource usage (cpu, memory) and reporting to the Resource Manager. • YARN maintains compatibility with existing MapReduce application and support other applications

  6. YARN – Hub for Big Data Applications • App-specific AM • HOYA (Hbase On YArn) • Long running services (YARN-896) • LLAMA (Low Latency Application MAster) • Gang Scheduler (YARN-624) OpenMPI Distributed Shell Spark HBase MapReduce Tez Storm Impala YARN HDFS

  7. YARN and Cloud • Two different prospective: • YARN-centric prospective • YARN is the key platform to apps • YARN is independent of infrastructure, running on top of Cloud shows YARN’s generality • Cloud-centric prospective • YARN is an umbrella kind of applications • Supporting YARN shows Cloud’s generality

  8. YARN and Cloud: YARN-centric Prospective Big Data Apps … HBase Open MPI Distributed Shell Spark YARN Impala MapReduce Tez Storm Infrastructure Cloud Infrastructure Open Stack … VMware Bare-metal machines …

  9. YARN and Cloud: Cloud-centric Prospective Legacy Apps YARN Apps Non-YARN Big Data Apps … … HBase Open MPI D.S Spark YARN Impala MapReduce Tez Storm Cloud Infrastructure (VMware, Open Stack, etc.)

  10. YARN vs. Cloud • Similarity • Target to share resources across applications • Provide Global Resource Management • YARN vs. Cloud • YARN managing resource in OS layer vs. Cloud managing resources in Hypervisor (Not comparable, but Hypervisor is more powerful than OS ) • Apps managed by YARN need specific AppMaster, Apps managed by Cloud is exactly the same as running on physical machines (Cloud ) • YARN tracking application-specific metrics/progress, Cloud only track underlayer resources (YARN )

  11. YARN + Cloud • Why YARN + Cloud? • Leverage virtualization in strong isolation, fine-grained resource sharing and other benefits • Uniform infrastructure to simplify IT in enterprise • What it looks like? • Running YARN NM inside of VMs managed by Cloud Infrastructure • Build communication channel between YARN RM and Cloud Resource Manager for coordination • How we do? • First thing above is very easy and smoothly • Second things to achieve in two ways • YARN can aware/manipulate Cloud resource change • YARN provide a generic resource notification mechanism so Cloud Manager can use when resource changing

  12. Elastic YARN Node in the Cloud • VM’s resource boundary can be elastic • CPU is easy – time slicing (with constraints) • Memory is harder – page sharing and memory ballooning • In case of contention, enforce limits and proportional sharing • “Stealing” resources behind apps could cause bad performance (paging) • App aware resource management could address these issues • Hadoop YARN Resource Model • Dynamic with adding/removing nodes • But static for per node • In this case, shall we enable resource elasticity on VM? • If yes, low performance when resource contention happens. • If no, low utilization as physical boxes because free resources cannot be leveraged by other busy VMs • We need better answer .

  13. HVE provide the answer! • Hadoop Virtualization Extensions • A project to enhance Hadoop running on virtualization • Goal: Make Hadoop Cloud-Ready • Provide Virtualization-awareness to Hadoop, i.e. virtual topology, virtual resources, etc. • Deliver generic utility that can be leveraged by virtualized platform • Independent of virtualization platform and cloud infrastructure • 100% contribution to Apache Hadoop Community

  14. HVE • Philosophy • make infrastructure related components abstract • deliver different implementations that can be configured properly • E.g. BlockPlacementPolicy (Abstract) BlockPlacementPolicy BlockPlacementPolicy For Virtualization BlockPlacementPolicy Default

  15. Elastic YARN Node in the Cloud Other Workload Virtual YARN Node NodeManager Add/Remove Resources? Container Container Datanode Grow/Shrink by tens of GB in memory? Virtualization Host VMDK Grow/Shrink resource of a VM

  16. Implementation – YARN-291 (umbrella) • YARN-312 • AdminProtocol changes • REST API, JMX, etc. • YARN-311 • Core scheduler changes • YARN-313 • CLI Resource Manager UpdateNodeResource() Scheduler AdminService Admin CLI Cluster Resource yarn rmadmin -updateNodeResource<NodeId> <Resource> SchedulerNode RMContext RMNode Cloud Resource Manager Resource Tracker Service Heartbeat Node Manager

  17. Reference • YARN MapReduce 2.0 • https://issues.apache.org/jira/browse/MAPREDUCE-279 • HVE topology extension • https://issues.apache.org/jira/browse/HADOOP-8468 • HVE topology extension for YARN • https://issues.apache.org/jira/browse/YARN-18 • HVE elastic resource configuration • https://issues.apache.org/jira/browse/YARN-291 • Gang Scheduling • https://issues.apache.org/jira/browse/YARN-624 • Long-lived services in YARN • https://issues.apache.org/jira/browse/YARN-896

  18. Thanks!Junping Du jdu@vmware.com

More Related