HBase MTTR, Stripe Compaction and Hoya

HBaseMTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

About myself • Been working on Hbase for 3 years • Became Committer & PMC member June 2011

Outline • Overview to HBase Recovery • HDFS issues • Stripe compaction • Hbase-on-Yarn • Q & A

We’re in a distributed system • Hard to distinguish a slow server from a dead server • Everything, or, nearly everything, is based on timeout • Smaller timeouts means more false positive • HBaseworks well with false positive, but they always have a cost. • The less the timeouts the better

HBase components for recovery

Recovery in action

Recovery process ZK Heartbeat • Failure detection: ZooKeeper heartbeats the servers. Expire the session when it does not reply • Region assignment: the master reallocates the regions to the other servers • Failure recovery: read the WAL and rewrite the data again • The client stops the connection to the dead server and goes to the new one. Master, RS, ZK Region Assignment Region Servers, DataNode Data recovery Client

Failure detection • Failure detection • Set a ZooKeeper timeout to 30s instead of the old 180s default. • Beware of the GC, but lower values are possible. • ZooKeeper detects the errors sooner than the configured timeout • 0.96 • HBase scripts clean the ZK node when the server is kill -9ed • => Detection time becomes 0 • Can be used by any monitoring tool

With faster region assignment • Detection: from 180s to 30s • Data recovery: around 10s • Reassignment : from 10s of seconds to seconds

DataNode crash is expensive! • One replica of WAL edits is on the crashed DN • 33% of the reads during the regionserver recovery will go to it • Many writes will go to it as well (the smaller the cluster, the higher that probability) • NameNode re-replicates the data (maybe TBs) that was on this node to restore replica count • NameNode does this work only after a good timeout (10 minutes by default)

HDFS – Stale mode Live As today: used for reads & writes, using locality 30 seconds, can be less. Stale Not used for writes, used as last resort for reads 10 minutes, don’t change this Dead As today: not used. And actually, it’s better to do the HBase recovery before HDFS replicates the TBs of data of this node

Results • Do more read/writes to HDFS during the recovery • Multiple failures are still possible • Stale mode will still play its role • And set dfs.timeout to 30s • This limits the effect of two failures in a row. The cost of the second failure is 30s if you were unlucky

Here is the client

The client • You want the client to be patient • Retries when the system is already loaded is not good. • You want the client to learn about region servers dying, and to be able to react immediately. • You want the solution to be scalable.

Scalable solution • The master notifies the client • A cheap multicast message with the “dead servers” list. Sent 5 times for safety. • Off by default. • On reception, the client stops immediately waiting on the TCP connection. You can now enjoy large hbase.rpc.timeout

Faster recovery (HBASE-7006) • Previous algorithm • Read the WAL files • Write new Hfiles • Tell the region server it got new Hfiles • Put pressure on namenode • Remember: avoid putting pressure on the namenode • New algo: • Read the WAL • Write to the regionserver • We’re done (have seen great improvements in our tests) • TBD: Assign the WAL to a RegionServer local to a replica

HDFS WAL-file3 <region2:edit1><region1:edit2> …… <region3:edit1> …….. WAL-file2 <region2:edit1><region1:edit2> …… <region3:edit1> …….. Distributed log Splitting WAL-file1 <region2:edit1><region1:edit2> …… <region3:edit1> …….. writes reads RegionServer3 RegionServer0 RegionServer_x RegionServer_y RegionServer2 RegionServer1 reads writes HDFS Splitlog-file-for-region3 <region3:edit1><region1:edit2> …… <region3:edit1> …….. Splitlog-file-for-region2 <region2:edit1><region1:edit2> …… <region2:edit1> …….. Splitlog-file-for-region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..

HDFS WAL-file3 <region2:edit1><region1:edit2> …… <region3:edit1> …….. WAL-file2 <region2:edit1><region1:edit2> …… <region3:edit1> …….. Distributed log Replay WAL-file1 <region2:edit1><region1:edit2> …… <region3:edit1> …….. writes reads RegionServer3 RegionServer0 RegionServer_x RegionServer_y replays RegionServer2 RegionServer1 reads writes HDFS Recovered-file-for-region3 <region3:edit1><region1:edit2> …… <region3:edit1> …….. Recovered-file-for-region2 <region2:edit1><region1:edit2> …… <region2:edit1> …….. Recovered-file-for-region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..

Write during recovery • Concurrent writes allowed during the WAL replay – same memstore serves both • Events stream: your new recovery time is the failure detection time: max 30s, likely less! • Caveat: HBASE-8701 WAL Edits need to be applied in receiving order

MemStore flush • Real life: some tables are updated at a given moment then left alone • With a non empty memstore • More data to recover • It’s now possible to guarantee that we don’t have MemStore with old data • Improves real life MTTR • Helps online snapshots

.META. • .META. • There is no –ROOT- table in 0.95/0.96 • But .META. failures are critical • A lot of small improvements • Server now says to the client when a region has moved (client can avoid going to meta) • And a big one • .META. WAL is managed separately to allow an immediate recovery of META • With the new MemStore flush, ensure a quick recovery

Data locality post recovery • HBase performance depends on data-locality • After a recovery, you’ve lost it • Bad for performance • Here comes region groups • Assign 3 favoredRegionServers for every region • On failures assign the region to one of the secondaries • The data-locality issue is minimized on failures

Discoveries from cluster testing • HDFS-5016 Heartbeating thread blocks under some failure conditions leading to loss of datanodes • HBASE-9039 Parallel assignment and distributed log replay during recovery • Region splitting during distributed log replay may hinder recovery

Compactions example • Memstore fills up, files are flushed • When enough files accumulate, they are compacted writes … MemStore HDFS HFile HFile HFile HFile HFile Architecting the Future of Big Data

But, compaction cause slowdowns Looks like lots of I/O for no apparent benefit Example effect on reads (note better average)

Key ways to improve compactions • Read from fewer files • Separate files by row key, version, time, etc. • Allows large number of files to be present, uncompacted • Don't compact the data you don't need to compact • For example, old data in OpenTSDB-like systems • Obviously, results in less I/O • Make compactions smaller • Without too much I/O amplification or too many files • Results in less compaction-related outages • HBase works better with few large regions; however, large compactions cause unavailability

Stripe compactions (HBASE-7667) • Somewhat like LevelDB, partition the keys inside each region/store • But, only 1 level (plus optional L0) • Compared to regions, partitioning is more flexible • The default is a number of ~equal-sized stripes • To read, just read relevant stripes + L0, if present L0 HFile get 'hbase' HFile HFile HFile HFile HFile Row-key axis H Region start key: ccc iii: region end key eee ggg Architecting the Future of Big Data

Stripe compactions – writes • Data flushed from MemStoreinto several files • Each stripe compacts separately most of the time MemStore HFile HFile HFile HFile HFile HFile H H H HDFS H Architecting the Future of Big Data

Stripe compactions – other • Why Level0? • Bulk loaded files go to L0 • Flushes can also go into single L0 files (to avoid tiny files) • Several L0 files are then compacted into striped files • Can drop deletes if compacting one entire stripe +L0 • No need for major compactions, ever • Compact 2 stripes together – rebalance if unbalanced • Very rare, however - unbalanced stripes are not a huge deal • Boundaries could be used to improve region splits in future Architecting the Future of Big Data

Stripe compactions - performance • EC2, c1.xlarge, preload; then measure random read perf • LoadTestTool + deletes + overwrites; measure random reads Architecting the Future of Big Data

Hbase on Yarn • Hoya is a YARN application • All components are YARN services • Input is cluster specification, persisted as JSON document on HDFS • HDFS and ZooKeeper are shared by multiple cluster instances • The cluster can also be stopped and later resumed

Hoya Architecture • Hoya Client: parses commandline, executes local operations, talks to HoyaMasterService • HoyaMasterService: AM service, deploys the HBase master locally • HoyaRegionService: installs and executes the region server

HBase Master Service Deployment • HoyaMasterServicerequested to create cluster • Local Hbasedir chosen for expanded image • User supplied configdir overwrites conf files in conf directory • Hbaseconf patched with hostname of master • HoyaMasterService monitors reporting from RM

Failure Handling • Region Service failures trigger new RS instances • MasterService failures not trigger restart • RegionServicemonitors ZK node for master • MasterService monitors state of Hbase master

Runtime classpathdependencies

Q & A Thanks!

HBase MTTR, Stripe Compaction and Hoya

HBase MTTR, Stripe Compaction and Hoya

Presentation Transcript

Compaction

HBase

HBase

HBase Tracing

HBASE

HBase

Hbase Operations

MapReduce , HBase , and Hive

HBase and Bigtable Storage

HBase

COMPACTION

Hadoop, HBase, and Healthcare

HOYA PURPUREOFUSCA

Compaction

Non Radar MTTR

HBase

HBase

Compaction

Compaction