1 / 45

What does it mean to virtualize the Hadoop File System?

Explore the concept of virtualizing the Hadoop File System (HDFS) and its different methods, advantages, and considerations. Learn when to choose HDFS storage virtualization.

migliore
Download Presentation

What does it mean to virtualize the Hadoop File System?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What does it mean to virtualize the Hadoop File System? • Tom Phelan • Chief Architect for BlueData

  2. It is HDFS …

  3. Unless it is not

  4. Outline • There are questions to be answered … • Three “What”’s: • What is HDFS? • What does it mean to virtualize HDFS? • What are the different methods of virtualization? • Instances • Advantages and considerations • And a “When”: • When to choose HDFS storage virtualization?

  5. What is HDFS? Before we can virtualize it, we need to understand what “it” is.

  6. HDFS It is a distributed file system built with NameNodes and DataNodes Source: David Engfer via slidershare.net http://image.slidesharecdn.com/introtohadoop-javamug-110414122200-phpapp01/95/intro-to-the-hadoop-stack-april-2011-javamug-14-728.jpg?cb=1302793500

  7. HDFS Implementation HDFS Implementation • hadoop-hdfs.jar • org.apache.hadoop.fs.FileSystem • org.apache.hadoop.hdfs.FileSystem • org.apache.hadoop.hdfs.DistributedFileSystem

  8. HDFS Implementation It is a stack of Java code used by Hadoop applications to access data. YARN HDFS Implementation Hadoop Distributed File System API/Java Class HDFS Implementation Distributed File System Client Protocol at TCP/IP level – “over the wire” HDFS Implementation

  9. HDFS Layers of Potential Virtualization • Generic Java Classes • Java class org.apache.hadoop.fs.FileSystem • HDFS over the wire protocol • Java class org.apache.hadoop.hdfs.DFSClient

  10. HDFS Implementation Wire Protocol Host HDFS Implementation Host Node Manager Node Manager Local Disk Local Disk Resource Manager NameNode Local Disk Local Disk DataNode App App HDFS Impl HDFS Impl DFSClient DFSClient Host DataNode

  11. HDFS Virtualization The virtualization of either the HDFS Implementation or the Protocols

  12. Outline • There are questions to be answered … • Three “What”’s: • What is HDFS? • What does it mean to virtualize HDFS? • What are the different methods of virtualization? • Instances • Advantages and considerations • And a “When”: • When to choose HDFS storage virtualization?

  13. HDFS Virtualization Methods • Virtualize the HDFS Implementation • Implement one of the Hadoop Compatible File System (HCFS) Protocols • Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) • Implement a HCFS via the FileSystem protocol (fs.FileSystem)

  14. Virtualize the HDFS Implementation • This is the only method of HDFS virtualization that requires Hadoop compute virtualization. • Simple. Install a Hadoop distro into a cluster of virtualized compute nodes and run the HDFS services in the cluster storing data on vdisks/vmdks. • Instances of this type of HDFS virtualization include: • VMware BDE • Apache OpenStack Sahara • Cloudera Director • Hortonworks Cloudbreak

  15. Virtualize the HDFS Implementation VM HOST Resource Manager NameNode Node Manager Node Manager App App HOST VM VM HOST HDFS Impl HDFS Impl DFSClient DFSClient Local Disk DataNode DataNode Local Disk Local Disk Local Disk

  16. Virtualize the HDFS Implementation • Advantages: • Simple • No new Java code • Compute/data locality • Considerations: • Requires data ingest time • The clusters become stateful

  17. HDFS Virtualization Methods • Virtualize the HDFS Implementation • Implement a Hadoop Compatible File System – HCFS • Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) • Implement a HCFS via the FileSystem protocol (fs.FileSystem)

  18. Implement a HCFS via the over-the-wire protocol Use the unmodified hadoop-hdfs jar fs.defaultfs hdfs://1.2.3.4:8020/path • Instance: • EMC Isilon

  19. Implement a HCFS via the over-the-wire protocol Host Storage Service Resource Manager NameNode Node Manager Node Manager Local Disk Local Disk Local Disk Local Disk Local Disk Local Disk App App Host Host HDFS Impl HDFS Impl DFSClient DFSClient DataNode DataNode

  20. Implement a HCFS via the over-the-wire protocol • Advantages: • Multi-protocol • No new Java code • Enterprise storage services • Considerations: • Open source / proprietary • No compute / data locality

  21. HDFS Virtualization Methods • Virtualize the HDFS Implementation • Implement a Hadoop Compatible File System – HCFS • Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) • Implement a HCFS via the FileSystem protocol (fs.FileSystem)

  22. Implement a HCFS via the FileSystem Java classes Write the java code that implements the class, build a jar file, put the jar file in the YARN services class path edit the core-site.xml file • Instances: • S3 and S3a/S3n – org.apache.hadoop.fs.FileSystem • https://github.com/Aloisius/hadoop-s3a • GlusterFS - org.apache.hadoop.fs.FilterFileSystem • https://github.com/gluster/glusterfs-hadoop • Tachyon – org.apache.hadoop.fs.FileSystem • https://github.com/amplab/tachyon • Apache Ignite – org.apache.hadoop.fs.AbstractFileSystem • https://github.com/apache/ignite

  23. Implement a HCFS via the FileSystem Java classes Host Storage Service Local Disk Local Disk Resource Manager NameNode Local Disk Local Disk Host Host Node Manager Node Manager Storage Service Storage Service DataNode App DataNode App HDFS Impl CustomFS Impl HDFS Impl CustomFS Impl DFSClient DFSClient

  24. Implement a HCFS via the FileSystem Java classes Host Storage Service Resource Manager NameNode Local Disk Local Disk Local Disk Local Disk Local Disk Local Disk Host Host Node Manager Node Manager App Storage Service DataNode DataNode App Storage Service CustomFS Impl HDFS Impl CustomFS Impl HDFS Impl DFSClient DFSClient

  25. Implement a HCFS via the FileSystem Java classes • Advantages: • Open source / proprietary • Multiple file access protocols supported • Considerations: • These are file systems • New Java code • Possibly no compute / data locality • May lag latest HDFS feature set

  26. HDFS Virtualization Is there another way?

  27. HDFS Virtualization • Virtualize the HDFS Implementation • Implement a Hadoop Compatible File System – HCFS • Implement a HCFS via the over-the-wire protocol • Implement a HCFS via the FileSystem Java classes • Virtualize the Hadoop Compatible File System Protocol

  28. Virtualize the Hadoop Compatible File System Protocol Translate the Hadoop File System Calls into native calls to the BackEnd File systems Insert intelligent caching layer • Instance: • BlueData EPIC software – org.apache.fs.FileSystem

  29. Virtualize the Hadoop Compatible File System Protocol Host Host Storage Service Resource Manager NameNode Local Disk Local Disk Host Host Node Manager DTAP Service DTAP Service Node Manager App App Local Disk DataNode DataNode HDFS Impl DTAP Impl Local Disk HDFS Impl Local Disk DTAP Impl DFSClient Local Disk DFSClient

  30. HDFS mem cache Page Cache DataNode page HDFS Implementation DFSClient Application is cache aware

  31. Extend mem cache to any File System or Object storage Page Cache HDFS GlusterFS Object Store page DTAP Service DTAP FileSystem Implementation Application is cache unaware

  32. Virtualize the Hadoop Compatible File System Protocol • Advantages: • Not a file system • Transparent in memory cache • write back, read ahead • Supports multiple protocols • Supports compute / data locality • Considerations: • New Java code • Open source / proprietary • May lag latest HDFS feature set

  33. Let’s Review

  34. Outline • There are questions to be answered … • Three “What”’s: • What is HDFS? • What does it mean to virtualize HDFS? • What are the different methods of virtualization? • Instances • Advantages and considerations • And a “When”: • When to choose HDFS storage virtualization?

  35. A Few Words about Performance • Performance measurements are an art as well as a science • Bottlenecks in applications • Bottlenecks in infrastructure • network • CPU • disk • Configuration is key • block size • distro • security

  36. Virtualize the HDFS Implementation Performance – VMware BDE Source of graph: VMware Technical Paper – Virtualized Hadoop Performance with VMware vSphere 6 on High Performance Servers

  37. Performance – Isilon Implement a HCFS via the over-the-wire protocol Sourceof graph: Stefan Radtke blog post http://stefanradtke.blogspot.com/2015/05/comparing-hadoop-performance-on-das-and.html

  38. Performance – Tachyon Implement a HCFS via the FileSystem Java classes Source of graph: Haoyuan Li https://spark-summit.org/2014/wp-content/uploads/2014/07/Tachyon-Further-Improve-Sparks-Performance-Haoyuan-Li.pdf

  39. Performance – BlueData Virtualize the Hadoop Compatible File System Protocol Source of Graph: BlueData customer proof-of-concept results

  40. Virtualized HDFS solutions provide good performance Even with remote storage Even in virtualized environments

  41. When it comes to Hadoop storage virtualization, speed is not the whole story • Other factors to consider when implementing a virtualized HDFS option: • Use of a virtualized compute environment • Open source / proprietary solution • Required Hadoop File System features • Lifespan of Hadoop cluster

  42. When it comes to Hadoop storage virtualization, speed is not the whole story • Other factors to consider when selecting storage: • Data accessibility • Hadoop File System protocol • NFS, object store, other protocols • Enterprise storage services • data protection • geographical replication • offline backup

  43. Consider a Virtualized HDFS Solution When any of the following are true: • Hadoop and non-Hadoop applications are required to access the same data • Do not want to replicate the data • Enterprise storage data services required • Need to run Hadoop in a virtual compute environment

  44. Hadoop File System Volume, Velocity, Variety Virtualization

  45. Q & A • twitter: @tapbluedata • email: tap@bluedata.com • www.bluedata.com • Visit our booth in the Expo

More Related