290 likes | 477 Views
Survey : Cloud Storage Systems. Presented By : Nitya Shankaran Ritika Sharma. Overview . Motivation behind the project Need to know how your data can be stored efficiently in the cloud i.e. choosing the right kind of storage. What has been done so far. Types of Data Storage.
E N D
Survey : Cloud Storage Systems Presented By : NityaShankaran Ritika Sharma
Overview • Motivation behind the project • Need to know how your data can be stored efficiently in the cloudi.e. choosing the right kind of storage. • What has been done so far
Types of Data Storage • Object storage • Relational Database storage • Distributed File systems etc.
Object Storage • Uses data objects instead of files to store and retrieve data • Maintains an index of Object ID (OID) numbers • Ideal for storing large files.
Amazon S3 • “Provides a simple web interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web”. • Currently, S3 stores over 449 billion user objects as of July 2011 and handles 900 million user requests a day. • Amazon claims that S3 provides infinite storage capacity, infinite data durability, 99.99% availability and good data access performance.
How Amazon S3 stores data • Makes use of buckets • Objects are identified by a unique key which is assigned by the user • S3 stores objects of upto 5 TB in size, each accompanied by 2 KB of metadata (Content-type, date last modified etc.). • ACL’s - Read for objects and buckets - Write for buckets only - Read and write for objects • Buckets and objects are created, listed and retrieved using REST-style HTTP interface or a SOAP interface.
Evaluation of S3 • Experimental Setup • Features and Findings: • Data Durability • Replica Replacement • Data Reliability • Availability • It provides versioning of object • Data Access performance • Security • Easy to use
Swift • Used for creating redundant, scalable object storage using clusters of standardized servers to store petabytes of accessible data. • Provides greater scalability, redundancy and permanence due to no central point of control. • Objects are written to multiple hardware devices in the data center, with the OpenStack software responsible for ensuring data replication and integrity across the cluster. • Storage clusters can scale horizontally by adding new nodes. Should a node fail, OpenStack works to replicate its content from other active nodes. • Used mainly for virtual machine images, photo storage, email storage and backup archiving. • Swift has a ReST-ful API.
Architecture of Swift Proxy server processes API requests and routes requests to storage nodes Auth server authenticates and authorizes requests Ring represents mapping between the names of entities stored on disk and their physical location Replicator provides redundancy for accounts, containers, objects Updater processes failed or queued updates Auditor verifies integrity of objects, containers, and accounts Account Server handles listing of containers, stores as SQLite DB Container Server handles listing of objects, stores as SQLite DB Object Server stores, retrieves, and deletes objects stored on local devices
Evaluation of Swift • Data Durability • Replica Replacement • Data Reliability • Availability • Data scalability • Security • Objects must be < 5GB • Not a Filesystem • No User Quotas • No Directory Hierarchies • No writing to a byte offset in a file • No ACL’s
Swift is mainly used for: • Storing media libraries (photos, music, videos, etc.) • Archiving video surveillance files • Archiving phone call audio recordings • Archiving compressed log files • Archiving backups (< 5GB each object) • Storing and loading of OS Images, etc. • Storing file populations that grow continuously on a practically infinite basis. • Storing small files (<50 KB). OpenStack Object Storage is great at this. • Storing billions of files. • Storing Petabytes (millions of Gigabytes) of data.
Relational Database Storage (RDS) • Aims to move much of the operational burden of provisioning, configuration, scaling, performance tuning, backup, privacy, and access control from the database users to the service operator, offering lower overall costs to users. • Advantages: - Hardware costs are lower • - Operational costs are lower • Disadvantages: - Inability to scale well - Labor intensive (managing relational databases) - Error prone - Increased complexity since each database package comes with its own configuration options, tools, performances sensitivities and bugs.
Microsoft SQL Azure • Cloud-based relational database service built on SQL Server • Uses T-SQL as the query language and Tabular Data Stream (TDS) as the protocol • Does not provide a REST-based API to access the service over HTTP unlike S3. Instead, uses SQL Azure accessed via Tabular Data Stream (TDS). • Allows relational queries to be made against stored data • Enables querying data, search, data analysis and data synchronization.
Network Topology – Part 1 HTTP/REST Client Layer (At custom premises or Windows Azure Platform)
Evaluation Of SQL Azure • Replica Replacement • Data Reliability • Availability • Data Access performance • Security • Scalability: You can store any amount of data, from kilobytes to terabytes, in SQL Azure. However, individual databases are limited to 10 GB in size. • Sharding Data sharding is a technique used by many applications to improve performance, scalability and cost by partitioning the data. For example, applications that store and process sales data using date or time predicates. These applications can benefit from processing a subset of the data instead of the entire data set.
Google File System • Architecture • Single Master • Multiple Chunkservers • HeartBeat Messages • Chunk Size = 64MB • Metadata • File & Chunk namespace • Mapping from files to chunks • Locations of each chunk’s replica
Gfs- Micro-benchmarks • GFS Cluster • 1 master and 2 master replicas • 16 chunkservers, • 16 clients • Dual 1.4 GHz P3 processor • 100Mbps full-duplex ethernet connection to switch • 19 servers are connected to switch S1 and 16 clients are connected to switch S2. • S1 and S2 are connected to with a 1 Gbps link • Reads • Limit Peak • 125MB/s for 1 Gbps link • 12.5MB/s for 100Mbps link • When 1 client is reading • Read rate = 10MB/s = 80% of 12.5MB/s • When 16 clients are reading • Read rate = 94MB/s = 6MB/s per client = 75% of limit peak
Gfs- Micro-benchmarks (cont’d) • Writes • Limit • Input connection = 12.5MB/s • Limit = 67MB/s (write each byte to 3 of 16 chunkservers) • When 1 client is writing • 6.3MB/s (delays in propagation data b/w servers) • When 16 clients are writing • 35MB/s • 2.2 MB/s per client • Records Append • Performance = n/w bandwidth of chunkserver having last chunk of file • Independent of no. of clients • When 1 client is appending • Limit = 6MB/s • When 16 clents are appending • 4.8 MB/s
Features • Data Integrity : Checksum • Replica Placement • Data reliability • Availability • Fast recovery • Chunk and Master replication • Rebalancing Replicas • Better disk space • Load balancing • Garbage Collection • Does not immediately reclaim the available storage • File – renamed to a hidden name • Removes after 3 days • Stale Replica Detection • Chunk version number • Increments on updation
Hadoop Distributed File system • Architecture • NameNode - metadata • Hierarchy of files & directories • Attributes • Block size = 128MB • Primary and Secondary NameNode • DataNode - Application data • Each block replica – 2 files • Data • Block’s metadata • Checksum • Handshake- Namenode & Datanode • Verify namespace ID • s/w version • Communication via TCP • Heartbeat Message • HDFS Client • Interface b/w user application and HDFS • Reading a file • Writing a file • Single-writer, multiple-reader model
Hdfs- benchmarks • HDFS clusters at Yahoo! • 3500 nodes • 2 quad core Xeopn Processors @ 2.5ghz • Linux • 16GB RAM • 1Gbps ethernet • DFSIO benchmark • Read: 66MB/s per node • Write: 40 MB/s per node • Busy Cluster Read: 1.02 MB/s per node • Busy Cluster Write: 1.09 MB/s per node • NNThroughput benchmark • Starts NameNode app and multiple client threads on same node
Features • Good Placement Policy • 1 replica per DataNode • < 2 replica per Rack • Data reliability • Availability • N/w bandwidth utilization • Replication Management • Priority queue • Load balancing • No strategy • Balancer – tool • Application Program • Disk usage • Cluster utilization – node utilization = range (0,1) • Data Integrity • Block scanner on Node • Verifies Checksum • Inter-Cluster data Copy