1 / 37

HBase MTTR, Stripe Compaction and Hoya

HBase MTTR, Stripe Compaction and Hoya. Ted Yu ( tyu@hortonworks.com ). About myself. Been working on Hbase for 3 years Became Committer & PMC member June 2011. Outline. Overview to HBase Recovery HDFS issues Stripe compaction Hbase -on-Yarn Q & A. We’re in a distributed system.

aldis
Download Presentation

HBase MTTR, Stripe Compaction and Hoya

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HBaseMTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

  2. About myself • Been working on Hbase for 3 years • Became Committer & PMC member June 2011

  3. Outline • Overview to HBase Recovery • HDFS issues • Stripe compaction • Hbase-on-Yarn • Q & A

  4. We’re in a distributed system • Hard to distinguish a slow server from a dead server • Everything, or, nearly everything, is based on timeout • Smaller timeouts means more false positive • HBaseworks well with false positive, but they always have a cost. • The less the timeouts the better

  5. HBase components for recovery

  6. Recovery in action

  7. Recovery process ZK Heartbeat • Failure detection: ZooKeeper heartbeats the servers. Expire the session when it does not reply • Region assignment: the master reallocates the regions to the other servers • Failure recovery: read the WAL and rewrite the data again • The client stops the connection to the dead server and goes to the new one. Master, RS, ZK Region Assignment Region Servers, DataNode Data recovery Client

  8. Failure detection • Failure detection • Set a ZooKeeper timeout to 30s instead of the old 180s default. • Beware of the GC, but lower values are possible. • ZooKeeper detects the errors sooner than the configured timeout • 0.96 • HBase scripts clean the ZK node when the server is kill -9ed • => Detection time becomes 0 • Can be used by any monitoring tool

  9. With faster region assignment • Detection: from 180s to 30s • Data recovery: around 10s • Reassignment : from 10s of seconds to seconds

  10. DataNode crash is expensive! • One replica of WAL edits is on the crashed DN • 33% of the reads during the regionserver recovery will go to it • Many writes will go to it as well (the smaller the cluster, the higher that probability) • NameNode re-replicates the data (maybe TBs) that was on this node to restore replica count • NameNode does this work only after a good timeout (10 minutes by default)

  11. HDFS – Stale mode Live As today: used for reads & writes, using locality 30 seconds, can be less. Stale Not used for writes, used as last resort for reads 10 minutes, don’t change this Dead As today: not used. And actually, it’s better to do the HBase recovery before HDFS replicates the TBs of data of this node

  12. Results • Do more read/writes to HDFS during the recovery • Multiple failures are still possible • Stale mode will still play its role • And set dfs.timeout to 30s • This limits the effect of two failures in a row. The cost of the second failure is 30s if you were unlucky

  13. Here is the client

  14. The client • You want the client to be patient • Retries when the system is already loaded is not good. • You want the client to learn about region servers dying, and to be able to react immediately. • You want the solution to be scalable.

  15. Scalable solution • The master notifies the client • A cheap multicast message with the “dead servers” list. Sent 5 times for safety. • Off by default. • On reception, the client stops immediately waiting on the TCP connection. You can now enjoy large hbase.rpc.timeout

  16. Faster recovery (HBASE-7006) • Previous algorithm • Read the WAL files • Write new Hfiles • Tell the region server it got new Hfiles • Put pressure on namenode • Remember: avoid putting pressure on the namenode • New algo: • Read the WAL • Write to the regionserver • We’re done (have seen great improvements in our tests) • TBD: Assign the WAL to a RegionServer local to a replica

  17. HDFS WAL-file3 <region2:edit1><region1:edit2> …… <region3:edit1> …….. WAL-file2 <region2:edit1><region1:edit2> …… <region3:edit1> …….. Distributed log Splitting WAL-file1 <region2:edit1><region1:edit2> …… <region3:edit1> …….. writes reads RegionServer3 RegionServer0 RegionServer_x RegionServer_y RegionServer2 RegionServer1 reads writes HDFS Splitlog-file-for-region3 <region3:edit1><region1:edit2> …… <region3:edit1> …….. Splitlog-file-for-region2 <region2:edit1><region1:edit2> …… <region2:edit1> …….. Splitlog-file-for-region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..

  18. HDFS WAL-file3 <region2:edit1><region1:edit2> …… <region3:edit1> …….. WAL-file2 <region2:edit1><region1:edit2> …… <region3:edit1> …….. Distributed log Replay WAL-file1 <region2:edit1><region1:edit2> …… <region3:edit1> …….. writes reads RegionServer3 RegionServer0 RegionServer_x RegionServer_y replays RegionServer2 RegionServer1 reads writes HDFS Recovered-file-for-region3 <region3:edit1><region1:edit2> …… <region3:edit1> …….. Recovered-file-for-region2 <region2:edit1><region1:edit2> …… <region2:edit1> …….. Recovered-file-for-region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..

  19. Write during recovery • Concurrent writes allowed during the WAL replay – same memstore serves both • Events stream: your new recovery time is the failure detection time: max 30s, likely less! • Caveat: HBASE-8701 WAL Edits need to be applied in receiving order

  20. MemStore flush • Real life: some tables are updated at a given moment then left alone • With a non empty memstore • More data to recover • It’s now possible to guarantee that we don’t have MemStore with old data • Improves real life MTTR • Helps online snapshots

  21. .META. • .META. • There is no –ROOT- table in 0.95/0.96 • But .META. failures are critical • A lot of small improvements • Server now says to the client when a region has moved (client can avoid going to meta) • And a big one • .META. WAL is managed separately to allow an immediate recovery of META • With the new MemStore flush, ensure a quick recovery

  22. Data locality post recovery • HBase performance depends on data-locality • After a recovery, you’ve lost it • Bad for performance • Here comes region groups • Assign 3 favoredRegionServers for every region • On failures assign the region to one of the secondaries • The data-locality issue is minimized on failures

  23. Discoveries from cluster testing • HDFS-5016 Heartbeating thread blocks under some failure conditions leading to loss of datanodes • HBASE-9039 Parallel assignment and distributed log replay during recovery • Region splitting during distributed log replay may hinder recovery

  24. Compactions example • Memstore fills up, files are flushed • When enough files accumulate, they are compacted writes … MemStore HDFS HFile HFile HFile HFile HFile Architecting the Future of Big Data

  25. But, compaction cause slowdowns Looks like lots of I/O for no apparent benefit Example effect on reads (note better average)

  26. Key ways to improve compactions • Read from fewer files • Separate files by row key, version, time, etc. • Allows large number of files to be present, uncompacted • Don't compact the data you don't need to compact • For example, old data in OpenTSDB-like systems • Obviously, results in less I/O • Make compactions smaller • Without too much I/O amplification or too many files • Results in less compaction-related outages • HBase works better with few large regions; however, large compactions cause unavailability

  27. Stripe compactions (HBASE-7667) • Somewhat like LevelDB, partition the keys inside each region/store • But, only 1 level (plus optional L0) • Compared to regions, partitioning is more flexible • The default is a number of ~equal-sized stripes • To read, just read relevant stripes + L0, if present L0 HFile get 'hbase' HFile HFile HFile HFile HFile Row-key axis H Region start key: ccc iii: region end key eee ggg Architecting the Future of Big Data

  28. Stripe compactions – writes • Data flushed from MemStoreinto several files • Each stripe compacts separately most of the time MemStore HFile HFile HFile HFile HFile HFile H H H HDFS H Architecting the Future of Big Data

  29. Stripe compactions – other • Why Level0? • Bulk loaded files go to L0 • Flushes can also go into single L0 files (to avoid tiny files) • Several L0 files are then compacted into striped files • Can drop deletes if compacting one entire stripe +L0 • No need for major compactions, ever • Compact 2 stripes together – rebalance if unbalanced • Very rare, however - unbalanced stripes are not a huge deal • Boundaries could be used to improve region splits in future Architecting the Future of Big Data

  30. Stripe compactions - performance • EC2, c1.xlarge, preload; then measure random read perf • LoadTestTool + deletes + overwrites; measure random reads Architecting the Future of Big Data

  31. Hbase on Yarn • Hoya is a YARN application • All components are YARN services • Input is cluster specification, persisted as JSON document on HDFS • HDFS and ZooKeeper are shared by multiple cluster instances • The cluster can also be stopped and later resumed

  32. Hoya Architecture • Hoya Client: parses commandline, executes local operations, talks to HoyaMasterService • HoyaMasterService: AM service, deploys the HBase master locally • HoyaRegionService: installs and executes the region server

  33. HBase Master Service Deployment • HoyaMasterServicerequested to create cluster • Local Hbasedir chosen for expanded image • User supplied configdir overwrites conf files in conf directory • Hbaseconf patched with hostname of master • HoyaMasterService monitors reporting from RM

  34. Failure Handling • Region Service failures trigger new RS instances • MasterService failures not trigger restart • RegionServicemonitors ZK node for master • MasterService monitors state of Hbase master

  35. Runtime classpathdependencies

  36. Q & A Thanks!

More Related