1 / 25

Green HDFS

Green HDFS. Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev. Presentation plan. 1. Current energy issues with HDFS and large server farms 2. Past approaches and solutions for energy conservation and cost cut

luka
Download Presentation

Green HDFS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GreenHDFS Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev

  2. Presentation plan • 1. Current energy issues with HDFS and large server farms • 2. Past approaches and solutions for energy conservation and cost cut • 3. GreenHDFS unique design and solution • 4. Conclusions and references

  3. Current Energy issues with HDFS • The purpose of HDFS was to build a scalable file system run on large number of commodity servers – currently ~ 155,500 at Yahoo

  4. Current Energy issues with HDFS • Large number of servers generate heat and consume energy in very large quantities • Over the lifetime of a server, the operating energy cost is comparable to the initial acquisition costs and ownership costs grow – power, cooling etc. • A lot of efforts and research put into solution for energy-conservation for extremely large scale server farms

  5. Past approaches and solutions • One of the commonly used is “Scale-down” approach– transitioning servers into low power consuming state • Example: Many datacenters transfer workloads and their state to a fewer number of servers during low activity hours • Problem? Above approach works only when servers are state-less – i.e. get all of their data from NAS/SAN

  6. Past approaches and solutions • “Scale-down” approaches work only with NAS/SAN since all of the data is stored on dedicated storage devices – possible to migrate workload to fewer number

  7. Problem with the past solutions • Hadoop distributes all of its files among many server – any of the thousand nodes can be participating at any moment

  8. GreenHDFS solution • Self-adaptive – depends only on HDFS and file access patterns • Applies Data-Classification techniques • Does energy-aware placement of data • Trades cost, performance, and power by separating cluster into logical zones

  9. Key observations during research • Team did a detailed analysis of files in a production Yahoo! Hadoop cluster: • Files are heterogeneous in access and lifespan patterns – some are rarely accessed, some get deleted shortly, some stay a while • 60% of data is “cold” or dormant – meaning lying without getting accessed – “need to exist for history files”

  10. Key observations during research • 95-98% of files had a very short “hotness” lifespan of less than 3 days – meaning it was actively used during the first 3 days • 90% of files in the top-level directory were dormant or “cold” for more than 18 days • Majority of the data had a news-server-like access pattern – where most of the computation happens soon after its creation

  11. GreenHDFS design • GreenHDFS organizes servers into logical Hot and Cold Zones using different policies – FMP, SCP, FRP FMP Performance, Cost and Power Hot Zone Files currently accessed and newly created High energy usage and Performance Cold Zone Files with low to rare access Low energy use and Sleeping mode

  12. GreenHDFS design • The goal of GreenHDFS is to have maximum number of servers in the Hot Zone and minimize the number in the Cold Zone • Servers in Cold Zone are storage-heavy • GreenHDFS heavily relies on the “temperature” of the files – higher the dormancy ( rarely accessed) the lower the temperature and vice versa • Dormancy is determined simply by getting the last access information upon file read

  13. FMP – File Migration Policy • FMP monitors the dormancy of the files and runs in the Hot Zone • This gives higher storage efficiency for the Hot Zone as less accessed files are moved to the Cold zone • Also gives significant energy-conservation Hot Zone Heavy Computations FMP Coldness > Threshold Cold Zone Idle Servers Hotness > Threshold

  14. Server Power Conserver Policy • SCP runs in the Cold Zone and determines which servers can go to standby/sleep mode • SCP uses hardware techniques to transfer CPU, Disks and DRAM into low power state • SCP wakes the server up only if: • Data on that server is accessed • New data needs to placed on that server

  15. File Reversal Policy • FRP runs in the Cold Zone and ensures that QoS, bandwidth, and response time is managed well if the files become “popular” • If the number of accesses to certain file becomes higher than the threshold – then file metadata is changed and gets “moved” to the Hot Zone • All the threshold values of FMP,SCP, FRP should be chosen so that it results in maximum energy efficiency

  16. File Lifespan – files are not equal • File goes to several stages in its lifetime: • File Creation – just created • Hot period – frequently used • Dormant period – not accessed • Deletion • GreenHDFS introduced various lifespan metrics and analyzed lifespan distributions to determine optimal threshold values for their policies • FileLifeSpanCFR- file creation to first read • FileLifeSpanCLR – file creation to last read • FileLifeSpanLRD – last read access and deletion • FileLifeSpanFLR – first read access and last read • FileLifeTime - from the creation to deletion

  17. FileLifeSpanCFR- first read

  18. FileLifeSpanCLR –last read/Hotness • Majority of files have short hotness lifespan

  19. FileLifeSpanLRD -file dormancy • 80% of files in d have dormancy period > 20 days

  20. GreenHDFS simulation • Simulation to test energy-conservation

  21. Energy savings with GreenHDFS • 24 % reduction in energy consumption ~ $2.1 million for 38,000 servers or $8.5 million saved on 155K servers today

  22. Storage efficiency with GreenHDFS • More servers and space available = better performance

  23. Conclusion • GreenHDFS is a policy-driven, self-adaptive, variant of HDFS • It relies on data classification driven data placement that gives significant periods of idleness on a subset of servers • It categorizes files into 2 zones: Hot and Cold • Applies sets of policies to classify files into Hot and Cold

  24. Conclusion • Energy consumption reduced by 24% and saved $2.1ml for 38,000 servers at that time. Today could be more than $8.5 million saved • Storage efficiency also increased since dormant files get moved to the Cold Zone • More space and better utilization of Hot Zone leads to better performance for HDFS/MapReduce

  25. References • http://www.cs.odu.edu/~mukka/cs775s11/Presentations/papers/kaushik.pdf • http://images.google.com/ • http://cloudera.com/ • http://hadoop.apache.org/

More Related