390 likes | 839 Views
HBase at Xiaomi. Liang Xie / Honghua Feng. {xieliang, fenghonghua}@xiaomi.com. About Us. Honghua Feng. Liang Xie. Outline. Introduction Latency practice Some patches we contributed Some ongoing patches Q&A. About Xiaomi. Mobile internet company founded in 2010
E N D
HBase at Xiaomi Liang Xie / Honghua Feng {xieliang, fenghonghua}@xiaomi.com www.mi.com
About Us Honghua Feng Liang Xie www.mi.com
Outline • Introduction • Latency practice • Some patches we contributed • Some ongoing patches • Q&A www.mi.com
About Xiaomi • Mobile internet company founded in 2010 • Sold 18.7 million phones in 2013 • Over $5 billion revenue in 2013 • Sold 11 million phones in Q1, 2014 www.mi.com
Hardware www.mi.com
Software www.mi.com
Internet Services www.mi.com
About Our HBase Team • Founded in October 2012 • 5 members • Liang Xie • Shaohui Liu • Jianwei Cui • Liangliang He • Honghua Feng • Resolved 130+ JIRAs so far www.mi.com
Our Clusters and Scenarios • 15 Clusters : 9 online / 2 processing / 4 test • Scenarios • MiCloud • MiPush • MiTalk • Perf Counter www.mi.com
Our Latency Pain Points • Java GC • Stable page write in OS layer • Slow buffered IO (FS journal IO) • Read/Write IO contention www.mi.com
HBase GC Practice • Bucket cache with off-heap mode • Xmn/ServivorRatio/MaxTenuringThreshold • PretenureSizeThreshold & repl src size • GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! www.mi.com
Write Latency Spikes HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it www.mi.com
Root Cause of Write Latency Spikes • write() is expected to be fast • But blocked by write-back sometimes! www.mi.com
Stable page write issue workaround Workaround : 2.6.32.279(6.3) -> 2.6.32.220(6.2) or 2.6.32.279(6.3) -> 2.6.32.358(6.4) Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! www.mi.com
Root Cause of Write Latency Spikes ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS www.mi.com
Write Latency SpikesTesting 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! www.mi.com
Hedged Read (HDFS-5776) www.mi.com
Other Meaningful Latency Work • Long first “put” issue (HBASE-10010) • Token invalid (HDFS-5637) • Retry/timeout setting in DFSClient • Reduce write traffic? (HLog compression) • HDFS IO Priority (HADOOP-10410) www.mi.com
Wish List • Real-time HDFS, esp. priority related • Core data structure GC friendly • More off-heap; shenandoah GC • TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… www.mi.com
Some Patches Xiaomi Contributed • New write thread model(HBASE-8755) • Reverse scan(HBASE-4811) • Per table/cf replication(HBASE-8751) • Block index key optimization(HBASE-7845) www.mi.com
1. New Write Thread Model Oldmodel: … … WriteHandler WriteHandler WriteHandler 256 Local Buffer WriteHandler : write to HDFS WriteHandler :write to HDFS 256 WriteHandler :write to HDFS WriteHandler : sync to HDFS WriteHandler :sync to HDFS 256 WriteHandler :sync to HDFS Problem : WriteHandler does everything, severe lock race! www.mi.com
New Write Thread Model Newmodel : … … WriteHandler WriteHandler WriteHandler 256 Local Buffer AsyncWriter : write to HDFS 1 AsyncSyncer : sync to HDFS WriteHandler :sync to HDFS WriteHandler :sync to HDFS 4 AsyncNotifier : notify writers 1 www.mi.com
New Write Thread Model • Lowload:No improvement • Heavyload:Hugeimprovement(3.5x) www.mi.com
2. Reverse Scan 1. All scanners seek to ‘previous’ rows(SeekBefore) 2. Figureoutnextrow:max ‘previous’ row 3. All scanners seek to first KV of nextrow(SeekTo) Row1 kv1 Row1 kv2 Row2 kv2 Row3 kv1 Row2 kv1 Row3 kv2 Row3 kv3 Row2 kv3 Row3 kv4 Row4 kv2 Row4 kv1 Row4 kv4 Row4 kv5 Row4 kv3 Row4 kv6 Row5 kv2 Row6 kv1 Row5 kv3 Performance: 70% of forward scan www.mi.com
3. Per Table/CF Replication • PeerB creates T2only: replication can’t work! PeerA (backup) • PeerB creates T1&T2: all data replicated! T1:cfA,cfB; T2:cfX,cfY Source T1 : cfA, cfB T2 : cfX, cfY PeerB (T2:cfX) ? Need a way to specify which data to replicate! www.mi.com
Per Table/CF Replication • add_peer ‘PeerA’, ‘PeerA_ZK’ • add_peer ‘PeerB’, ‘PeerB_ZK’,‘T2:cfX’ PeerA T1:cfA,cfB; T2:cfX,cfY Source T1 : cfA, cfB T2 : cfX, cfY PeerB (T2:cfX) T2:cfX www.mi.com
4. Block Index Key Optimization Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2) k1:“ab” k2 : “ah, hello world” … … Block 1 Block 2 • Reduce block index size • Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’] www.mi.com
Some ongoing patches • Cross-table cross-row transaction(HBASE-10999) • HLog compactor(HBASE-9873) • Adjusted delete semantic(HBASE-8721) • Coordinated compaction (HBASE-9528) • Quorum master (HBASE-10296) www.mi.com
1. Cross-Row Transaction:Themis http://github.com/xiaomi/themis • Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications • Two-phase commit : strong cross-table/row consistency • Global timestamp server : global strictly incremental timestamp • No touch to HBase internal: based on HBase Client and coprocessor • Read : 90%, Write : 23% (same downgrade as Google percolator) • More details : HBASE-10999 www.mi.com
2. HLog Compactor HLog 1,2,3 Region x : few writes but scatter in many HLogs Region 2 Region x Region 1 Memstore HFiles PeriodicMemstoreFlusher : flush old memstores forcefully • ‘flushCheckInterval’/‘flushPerChanges’ : hard to config • Result in ‘tiny’ HFiles • HBASE-10499 : problematic region can’t be flushed! www.mi.com
HLog Compactor HLog 1, 2, 3,4 • Compact :HLog 1,2,3,4 HLog x • Archive:HLog1,2,3,4 HLog x Region x Region 2 Region 1 Memstore HFiles www.mi.com
3. Adjusted Delete Semantic Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 2 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again 5. Read kvA Result : kvA can be read out Fix : “delete can’t mask kvs with larger mvcc ( put later )” www.mi.com
4. Coordinated Compaction RS RS RS Compact storm! HDFS (global resource) • Compact uses aglobalHDFS, whilewhethertocompactisdecidedlocally! www.mi.com
Coordinated Compaction RS RS RS Can I ? OK Master Can I ? NO Can I ? OK HDFS (global resource) • Compact is scheduled by master,nocompactstormanylonger www.mi.com
5. Quorum Master A zk2 zk3 X Master A Read info/states Master zk1 ZooKeeper RS RS RS • When active master serves, standby master stays ‘really’ idle • When standby master becomes active, it needs to rebuild in-memory status www.mi.com
Quorum Master A X Master 1 Master 3 A Master 2 RS RS RS • Better master failover perf : No phase to rebuild in-memory status • Better restart perf for BIG cluster(10+K regions) • No external(ZooKeeper) dependency • No potential consistency issue • Simpler deployment www.mi.com
Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong, Hao Huang, Hailei LiShaohui Liu, Jianwei Cui, Liangliang HeDihao Chen Acknowledgement www.mi.com
Thank You!xieliang@xiaomi.comfenghonghua@xiaomi.com www.mi.com www.mi.com