Hbase

Hbase

The HBase • HBase is a distributed column-oriented database built on top of HDFS. • Easy to scale to demand • HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets. • Use MapReduce to search • HBase depends on ZooKeeper and by default it manages a ZooKeeper instance as the authority on cluster state.

Data Model • A data model similar to Bigtable. • a data row has a sortable row key and an arbitrary number of columns • the table is stored sparsely, rows in the same table can have widely varying numbers of columns Conceptual View Physical Storage View

Example • Capture network packets into HDFS, save to a file for every minute. • Run MapReduce app, estimate flow status. • count tcp, udp, icmp packet number • compute tcp, udp, or all packet flow • The result save to HBase. • row key and timestamp are the captrue time

Display • Specify start time and stop time to scan table then estimate data and display as flow graph. • Sample output

The performance of accessing files to HDFS directly and through a HDFS-based FTP server

Accessing files to HDFS directly(1/7) • ssh登入namenode下達指令 • 上傳檔案至HDFS： • hadoop fs -Ddfs.block.size=資料區塊位元組數 -Ddfs.replication=資料區塊複製數量 -put 本機資料 HDFS檔案目錄 • 由HDFS下載檔案： • hadoop fs -get HDFS上的資料本機目錄

Accessing files to HDFS directly(2/7) • 觀察透過HDFS參數的調整，讓HDFS在不同條件下的檔案讀取效能。之後的標題中若標示R=1，表示某檔案在HDFS中的複製（備份）數量。

Accessing files to HDFS directly(3/7, R=1) （橫軸表示資料分割區塊大小，單位：byte）（縱軸表示一份資料完全寫入HDFS所需要的時間，單位：秒）

Accessing files to HDFS directly(4/7, R=1) （橫軸表示資料分割區塊大小，單位：byte）（縱軸表示一份資料完全從HDFS讀出所需要的時間，單位：秒）

Accessing files to HDFS directly(5/7, R=2) （橫軸表示資料分割區塊大小，單位：byte）（縱軸表示一份資料完全寫入HDFS所需要的時間，單位：秒）

Accessing files to HDFS directly(6/7, R=2) （橫軸表示資料分割區塊大小，單位：byte）（縱軸表示一份資料完全從HDFS讀出所需要的時間，單位：秒）

Accessing files to HDFS directly(7/7) • 結論 • 在運行NameNode daemon的namenode server上直接上下載檔案，原則上資料區塊大小以64MB或128MB效能較佳。 • 資料區塊複製數越多，雖在檔案寫入時會花較久的時間，但在檔案讀取時速度會些許提升。

Accessing files through a HDFS-based FTPserver(1/3) • 使用者用FTP client連上FTP server後 • lfs表示一般的FTP server daemon直接存取local file system。 • HDFS表示由我們撰寫的FTP server daemon，透過與位在同一台server上的NameNode daemon溝通後，存取HDFS。 • 之後上傳／下載完檔案花費之總秒數皆為測量3次秒數平均後之結果 • 網路頻寬約維持在10Mb/s~12Mb/s間

Accessing files through a HDFS-based FTP server(2/3) （橫軸：上傳單一檔案GB數）（縱軸：上傳完檔案花費總秒數）（HDFS：檔案區塊大小128MB，複製數=2）

Accessing files through a HDFS-based FTP server(3/3) （橫軸：下載單一檔案GB數）（縱軸：下載完檔案花費總秒數）（HDFS：檔案區塊大小128MB，複製數=2）

Hadoop認證分析 • the name node has no notion of the identity of the realuser。（沒有真實用戶的概念） • User Identity ： • The user name is the equivalent of「whoami」. • The group list is the equivalent of「bash -c groups」. • The super-user is the user with the same identity as name node process itself. If you started the name node, then you are the super-user.

Why Using Proxy to connect name node • DataNodes do not enforce any access control on accesses to its data blocks。(client可與datanode直接連線，提供Block ID即可read、write)。 • Hadoopclient(any user)can access HDFS or submit Mapreduce Job。 • Hadoop only works with SOCKS v5. ( in client，ClientProtocol and SubmissionProtocol) • 結論：hadoop(Private IP叢集)+ RADIUS + SOCKS proxy。

結構

Hadoop SOCKS • 只需在Hadoop client設定SOCKS連線，Namenode無需設定。

User 認證 • 使用SOCKS protocol的method(username、password)辨識Proxytransfer的權限。 • 由RADIUSServer紀錄user是否可以存取hadoop。(user-group) • User使用Hadoop client(whoami)的執行身分來存取Hadoop。

SOCKS Proxy 優缺點 • 優點： • 可進行user認證。 • 可過濾IPrange，限制使用proxy的網域。 • 不會儲存transfer的封包，單純forward。 • 缺點： • Client 端需支援 SOCKSprotocol。 • 可能會成為Bottleneck，傳輸速度(transfer)與硬體和選用的SOCKS軟體有關。

Hbase

Hbase

Presentation Transcript

HBase

HBase

Hbase : Hadoop Database

Hbase : Hadoop Database

HBase Tracing

HBASE

HBase Snapshots

Introduction to Hbase

HBase Dev Meetup

HBase

Hbase Operations

Hue HBase Browser

Hbase : Hadoop Database

HBase

Hadoop, HBase, and Healthcare

HBase at Xiaomi

HBase Programming

HBase Programming

HBase@Taobao

HBase

HBase

HBase Mohamed Eltabakh