260 likes | 467 Views
Experimentation – Group E2. Hadoop HDFS , HBASE SWIFT. HDFS Is. Open Sourced git://git.apache.org/hadoop-hdfs.git Distributed File System Provides high throughput access to application data Highly fault-tolerant For applications which have large data sets, typically GBs to TBs in size
E N D
Experimentation – Group E2 Hadoop HDFS , HBASE SWIFT
HDFS Is ... • Open Sourced • git://git.apache.org/hadoop-hdfs.git • Distributed File System • Provides high throughput access to application data • Highly fault-tolerant • For applications which have large data sets, typically GBs to TBs in size • A write-once-read-many framework (ensures data coherence) • files should not be changed after being closed • Designed for reliability of very large files, by storing files as replicated block sequences.
Architecture http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Failures • Heartbeat messages are used to determine Disk Failures by the NameNode (NN) • NN flags any Data Node (DN) with an absent heartbeat as dead • The data on the DN is then unavailable moving forward • The NN will initiate replication if a dead DN causes the replication count to drop for a (or many) block(s) • Data Integrity is guaranteed through the use of checksums on the data
Configuration Configured with 1 master (NN), 3 slaves (DN), default replication 2
Testing - Simple MapReduce MapReduce: • CrimeCount.java • Many files HDFS installation confirmed
DFSTestIO - Write Test • HDFS Write test • Stress tests the file system • Executes MapReduce job, therefore it fully tests the hdfs cluster • Creates 10 1 GB output files
DFSTestIO - Read Test • HDFS Read test • Stress tests the file system • Executes MapReduce job • Reads 10 1GB input files
HDFS Resources http://en.wikipedia.org/wiki/Apache_Hadoop#File_system http://hortonworks.com/hadoop/hdfs/ http://wise.ajou.ac.kr/mata/hadoop-benchmark-testdfsio/ http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
Apache HBase Is ... • Open Sourced • git://git.apache.org/hbase.git • Distributed • 100’s of servers • Sorted Map Datastore • Tables consisting of rows and column families • Column families are : • configureable - compression, caching, etc. • stored separately on disk to minimize disk I/O • Modeled after Google Big Table
Distributed Architecture http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction
Zookeeper • Service which maintains the configuration details • Handles distribution synchronization through a quorum based protocol • Organized similar to a standard file system, but all in memory • Optimal throughput and low latency • Replicated across multiple hosts (ensemble) • Guarantees: Sequential Consistency, Atomicity, Single System Image, Reliability, and Timeliness http://zookeeper.apache.org/doc/trunk/zookeeperOver.html
Benefits/Drawbacks • Random Read and Write capability • Slower performance than HDFS but 1000x faster than RDBMS (i.e. SQL Server) • Storage capacity of ~1 PB versus HDFS 30+ PB, or RDBMS ~32 TB • MapReduce analysis • Scalable
Configuration Configured with 1 master, 4 region servers, and 3 zookeeper quorum members
Testing - Simple lifecycle of a table Table lifecycle: • Create (w/ column family) • Add 3 rows • Display all rows • Display one row • Delete table HBase installation confirmed
Sequential Write Sequential Read Testing - HBase Performance and Scalability Analysis of these Sequential Write and Read tests is still underway.
HBase Resources http://wiki.apache.org/hadoop/Hbase/HowToContribute http://zookeeper.apache.org/ http://hbase.apache.org/ http://hbase.apache.org/book http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation http://blog.sematext.com/2010/08/30/hbase-case-study-using-hbasetestingutility-for-local-testing-development/
OpenStack Swift Cluster Architecture - Access Tier Working/Config : 4 Physical Nodes configured as Swift proxy node * 1: schiper 10.176.68.240 Swift object storage nodes * 3:lamport 10.176.68.230 chandy 10.176.68.229 mattern 10.176.68.248 2. Other combinations such as 2- Proxies , 2-storage servers configured . Need to re-change the configuration before benchmarking of each type. 1.schiper 2.lamport 3.chandy 4.mattern
Initial Benchmarking with swift-bench: swift-bench /etc/swift/swift-bench.conf OpenStack Object Storage comes with a benchmarking tool named swift-bench. This runs through a series of puts, gets, and deletions, calculating the throughput and reporting of any failures in our OpenStackObjectc Storage environment
Benchmarking with SSbench The main point of using ssbench and swift-bench is to research how new updates affect the number of average requests per second served by the testbed Object Storage under a fixed and specific load. example of an ssbench report, gathered on a single node: RAM: 2GB VCPU: 1 Timing cached reads: 5432.52 MB/sec Timing buffered disk reads: 57.91 MB/sec
Does it need openstack ? -Experimented with standalone swift -Installed Devstack with all components of openstack including storage(Swift) , Networking(neutron). - Standalone swift is sufficient for Benchmarking ?
Io Zone in Swift • Need to figure out the port on which the performance evaluation is to be done • Should be similar to Swift-bench