1 / 26

Experimentation – Group E2

Experimentation – Group E2. Hadoop HDFS , HBASE SWIFT. HDFS Is. Open Sourced git://git.apache.org/hadoop-hdfs.git Distributed File System Provides high throughput access to application data Highly fault-tolerant For applications which have large data sets, typically GBs to TBs in size

kapila
Download Presentation

Experimentation – Group E2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experimentation – Group E2 Hadoop HDFS , HBASE SWIFT

  2. HDFS Is ... • Open Sourced • git://git.apache.org/hadoop-hdfs.git • Distributed File System • Provides high throughput access to application data • Highly fault-tolerant • For applications which have large data sets, typically GBs to TBs in size • A write-once-read-many framework (ensures data coherence) • files should not be changed after being closed • Designed for reliability of very large files, by storing files as replicated block sequences.

  3. Architecture http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

  4. Failures • Heartbeat messages are used to determine Disk Failures by the NameNode (NN) • NN flags any Data Node (DN) with an absent heartbeat as dead • The data on the DN is then unavailable moving forward • The NN will initiate replication if a dead DN causes the replication count to drop for a (or many) block(s) • Data Integrity is guaranteed through the use of checksums on the data

  5. Configuration Configured with 1 master (NN), 3 slaves (DN), default replication 2

  6. Testing - Simple MapReduce MapReduce: • CrimeCount.java • Many files HDFS installation confirmed

  7. DFSTestIO - Write Test • HDFS Write test • Stress tests the file system • Executes MapReduce job, therefore it fully tests the hdfs cluster • Creates 10 1 GB output files

  8. DFSTestIO - Read Test • HDFS Read test • Stress tests the file system • Executes MapReduce job • Reads 10 1GB input files

  9. HDFS Resources http://en.wikipedia.org/wiki/Apache_Hadoop#File_system http://hortonworks.com/hadoop/hdfs/ http://wise.ajou.ac.kr/mata/hadoop-benchmark-testdfsio/ http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

  10. Apache HBase Is ... • Open Sourced • git://git.apache.org/hbase.git • Distributed • 100’s of servers • Sorted Map Datastore • Tables consisting of rows and column families • Column families are : • configureable - compression, caching, etc. • stored separately on disk to minimize disk I/O • Modeled after Google Big Table

  11. Distributed Architecture http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction

  12. Zookeeper • Service which maintains the configuration details • Handles distribution synchronization through a quorum based protocol • Organized similar to a standard file system, but all in memory • Optimal throughput and low latency • Replicated across multiple hosts (ensemble) • Guarantees: Sequential Consistency, Atomicity, Single System Image, Reliability, and Timeliness http://zookeeper.apache.org/doc/trunk/zookeeperOver.html

  13. Benefits/Drawbacks • Random Read and Write capability • Slower performance than HDFS but 1000x faster than RDBMS (i.e. SQL Server) • Storage capacity of ~1 PB versus HDFS 30+ PB, or RDBMS ~32 TB • MapReduce analysis • Scalable

  14. Configuration Configured with 1 master, 4 region servers, and 3 zookeeper quorum members

  15. Testing - Simple lifecycle of a table Table lifecycle: • Create (w/ column family) • Add 3 rows • Display all rows • Display one row • Delete table HBase installation confirmed

  16. Sequential Write Sequential Read Testing - HBase Performance and Scalability Analysis of these Sequential Write and Read tests is still underway.

  17. HBase Resources http://wiki.apache.org/hadoop/Hbase/HowToContribute http://zookeeper.apache.org/ http://hbase.apache.org/ http://hbase.apache.org/book http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation http://blog.sematext.com/2010/08/30/hbase-case-study-using-hbasetestingutility-for-local-testing-development/

  18. SWIFT

  19. OpenStack Swift Cluster Architecture - Access Tier Working/Config : 4 Physical Nodes configured as Swift proxy node * 1: schiper 10.176.68.240 Swift object storage nodes * 3:lamport 10.176.68.230 chandy 10.176.68.229 mattern 10.176.68.248 2. Other combinations such as 2- Proxies , 2-storage servers configured . Need to re-change the configuration before benchmarking of each type. 1.schiper 2.lamport 3.chandy 4.mattern

  20. Initial Benchmarking with swift-bench: swift-bench /etc/swift/swift-bench.conf OpenStack Object Storage comes with a benchmarking tool named swift-bench. This runs through a series of puts, gets, and deletions, calculating the throughput and reporting of any failures in our OpenStackObjectc Storage environment

  21. Benchmarking with SSbench The main point of using ssbench and swift-bench is to research how new updates affect the number of average requests per second served by the testbed Object Storage under a fixed and specific load. example of an ssbench report, gathered on a single node: RAM: 2GB VCPU: 1 Timing cached reads: 5432.52 MB/sec Timing buffered disk reads: 57.91 MB/sec

  22. Does it need openstack ? -Experimented with standalone swift -Installed Devstack with all components of openstack including storage(Swift) , Networking(neutron). - Standalone swift is sufficient for Benchmarking ?

  23. Io Zone in Swift • Need to figure out the port on which the performance evaluation is to be done • Should be similar to Swift-bench

More Related