HBase at Xiaomi

HBase at Xiaomi Liang Xie / Honghua Feng {xieliang, fenghonghua}@xiaomi.com www.mi.com

About Us Honghua Feng Liang Xie www.mi.com

Outline • Introduction • Latency practice • Some patches we contributed • Some ongoing patches • Q&A www.mi.com

About Xiaomi • Mobile internet company founded in 2010 • Sold 18.7 million phones in 2013 • Over $5 billion revenue in 2013 • Sold 11 million phones in Q1, 2014 www.mi.com

Hardware www.mi.com

Software www.mi.com

Internet Services www.mi.com

About Our HBase Team • Founded in October 2012 • 5 members • Liang Xie • Shaohui Liu • Jianwei Cui • Liangliang He • Honghua Feng • Resolved 130+ JIRAs so far www.mi.com

Our Clusters and Scenarios • 15 Clusters : 9 online / 2 processing / 4 test • Scenarios • MiCloud • MiPush • MiTalk • Perf Counter www.mi.com

Our Latency Pain Points • Java GC • Stable page write in OS layer • Slow buffered IO (FS journal IO) • Read/Write IO contention www.mi.com

HBase GC Practice • Bucket cache with off-heap mode • Xmn/ServivorRatio/MaxTenuringThreshold • PretenureSizeThreshold & repl src size • GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! www.mi.com

Write Latency Spikes HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it www.mi.com

Root Cause of Write Latency Spikes • write() is expected to be fast • But blocked by write-back sometimes! www.mi.com

Stable page write issue workaround Workaround : 2.6.32.279(6.3) -> 2.6.32.220(6.2) or 2.6.32.279(6.3) -> 2.6.32.358(6.4) Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! www.mi.com

Root Cause of Write Latency Spikes ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS www.mi.com

Write Latency SpikesTesting 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! www.mi.com

Hedged Read (HDFS-5776) www.mi.com

Other Meaningful Latency Work • Long first “put” issue (HBASE-10010) • Token invalid (HDFS-5637) • Retry/timeout setting in DFSClient • Reduce write traffic? (HLog compression) • HDFS IO Priority (HADOOP-10410) www.mi.com

Wish List • Real-time HDFS, esp. priority related • Core data structure GC friendly • More off-heap; shenandoah GC • TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… www.mi.com

Some Patches Xiaomi Contributed • New write thread model(HBASE-8755) • Reverse scan(HBASE-4811) • Per table/cf replication(HBASE-8751) • Block index key optimization(HBASE-7845) www.mi.com

1. New Write Thread Model Oldmodel: … … WriteHandler WriteHandler WriteHandler 256 Local Buffer WriteHandler : write to HDFS WriteHandler :write to HDFS 256 WriteHandler :write to HDFS WriteHandler : sync to HDFS WriteHandler :sync to HDFS 256 WriteHandler :sync to HDFS Problem : WriteHandler does everything, severe lock race! www.mi.com

New Write Thread Model Newmodel : … … WriteHandler WriteHandler WriteHandler 256 Local Buffer AsyncWriter : write to HDFS 1 AsyncSyncer : sync to HDFS WriteHandler :sync to HDFS WriteHandler :sync to HDFS 4 AsyncNotifier : notify writers 1 www.mi.com

New Write Thread Model • Lowload:No improvement • Heavyload:Hugeimprovement(3.5x) www.mi.com

2. Reverse Scan 1. All scanners seek to ‘previous’ rows(SeekBefore) 2. Figureoutnextrow:max ‘previous’ row 3. All scanners seek to first KV of nextrow(SeekTo) Row1 kv1 Row1 kv2 Row2 kv2 Row3 kv1 Row2 kv1 Row3 kv2 Row3 kv3 Row2 kv3 Row3 kv4 Row4 kv2 Row4 kv1 Row4 kv4 Row4 kv5 Row4 kv3 Row4 kv6 Row5 kv2 Row6 kv1 Row5 kv3 Performance: 70% of forward scan www.mi.com

3. Per Table/CF Replication • PeerB creates T2only: replication can’t work! PeerA (backup) • PeerB creates T1&T2: all data replicated! T1:cfA,cfB; T2:cfX,cfY Source T1 : cfA, cfB T2 : cfX, cfY PeerB (T2:cfX) ? Need a way to specify which data to replicate! www.mi.com

Per Table/CF Replication • add_peer ‘PeerA’, ‘PeerA_ZK’ • add_peer ‘PeerB’, ‘PeerB_ZK’,‘T2:cfX’ PeerA T1:cfA,cfB; T2:cfX,cfY Source T1 : cfA, cfB T2 : cfX, cfY PeerB (T2:cfX) T2:cfX www.mi.com

4. Block Index Key Optimization Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2) k1:“ab” k2 : “ah, hello world” … … Block 1 Block 2 • Reduce block index size • Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’] www.mi.com

Some ongoing patches • Cross-table cross-row transaction(HBASE-10999) • HLog compactor(HBASE-9873) • Adjusted delete semantic(HBASE-8721) • Coordinated compaction (HBASE-9528) • Quorum master (HBASE-10296) www.mi.com

1. Cross-Row Transaction:Themis http://github.com/xiaomi/themis • Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications • Two-phase commit : strong cross-table/row consistency • Global timestamp server : global strictly incremental timestamp • No touch to HBase internal: based on HBase Client and coprocessor • Read : 90%, Write : 23% (same downgrade as Google percolator) • More details : HBASE-10999 www.mi.com

2. HLog Compactor HLog 1,2,3 Region x : few writes but scatter in many HLogs Region 2 Region x Region 1 Memstore HFiles PeriodicMemstoreFlusher : flush old memstores forcefully • ‘flushCheckInterval’/‘flushPerChanges’ : hard to config • Result in ‘tiny’ HFiles • HBASE-10499 : problematic region can’t be flushed! www.mi.com

HLog Compactor HLog 1, 2, 3,4 • Compact :HLog 1,2,3,4  HLog x • Archive:HLog1,2,3,4 HLog x Region x Region 2 Region 1 Memstore HFiles www.mi.com

3. Adjusted Delete Semantic Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 2 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again 5. Read kvA Result : kvA can be read out Fix : “delete can’t mask kvs with larger mvcc ( put later )” www.mi.com

4. Coordinated Compaction RS RS RS Compact storm! HDFS (global resource) • Compact uses aglobalHDFS, whilewhethertocompactisdecidedlocally! www.mi.com

Coordinated Compaction RS RS RS Can I ? OK Master Can I ? NO Can I ? OK HDFS (global resource) • Compact is scheduled by master,nocompactstormanylonger www.mi.com

5. Quorum Master A zk2 zk3 X Master A Read info/states Master zk1 ZooKeeper RS RS RS • When active master serves, standby master stays ‘really’ idle • When standby master becomes active, it needs to rebuild in-memory status www.mi.com

Quorum Master A X Master 1 Master 3 A Master 2 RS RS RS • Better master failover perf : No phase to rebuild in-memory status • Better restart perf for BIG cluster(10+K regions) • No external(ZooKeeper) dependency • No potential consistency issue • Simpler deployment www.mi.com

Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong, Hao Huang, Hailei LiShaohui Liu, Jianwei Cui, Liangliang HeDihao Chen Acknowledgement www.mi.com

Thank You!xieliang@xiaomi.comfenghonghua@xiaomi.com www.mi.com www.mi.com

HBase at Xiaomi

HBase at Xiaomi

Presentation Transcript

HBase

HBase

Hbase Operations At facebook

HBase Tracing

HBASE

HBase Snapshots

HBase Dev Meetup

HBase

Hbase Operations

Hue HBase Browser

HBase

HBase Programming

HBase Programming

HBase and Hive at StumbleUpon

HBase

HBase

Xiaomi mi mix at Vancouver | easycellular

HBase Mohamed Eltabakh