210 likes | 483 Views
Data Management in the Cloud. Paul Szerlip. The rise of data. Think about this For the past two decades, the largest generator of data was humans -- now it's our devices Cheap sensors Cellphones are packed with sensory information Images, video, audio, etc Expensive sensors
E N D
Data Management in the Cloud Paul Szerlip
The rise of data • Think about this • For the past two decades, the largest generator of data was humans -- now it's our devices • Cheap sensors • Cellphones are packed with sensory information • Images, video, audio, etc • Expensive sensors • DZero, high energy physics, generates 1 TB a day • How do you deal with that much data? [1,2]
Data in the cloud • Storing the data • Bigtable, S3, NoSQL, etc • Processing the data • MapReduce, Hadoop, etc
Good data management in the cloud • Availability • Accessible in cases of partial network failure or datacenter failure • Scalability • Support for massive database sizes - spread across many servers • Elasticity • Scaling up and scaling down • Performance • Efficient system storage utilization • Multitenancy • Many applications on the same hardware
Good data management (continued) • Load and Tenant Balancing • Moving load between servers • Fault Tolerance • Tolerating network or hardware failures • Running in heterogeneous environment • Dealing with hardware degredation • Flexible query interface • Providing ways to access both SQL and non-SQL languages
Overarching Themes • Frustration with ACID on the cloud • (Atomicity, consistency, isolation, durability) • Hard to maintain ACID guarantees with data replication over large geographic distances [1] • Consistency, Availability, Tolerance to partitions, choose 2 • Rise of NoSQL (a misnomer) [2] • Eventually consistent can be okay, some ACID properties are relaxed or left to application developers
Investigating 3 Systems • Bigtable (Google) • And quick look at MapReduce • Amazon:S3/SimpleDB • Open source NoSQL alternatives: • Cassandra (key-value) • MongoDB (document)
Bigtable • Distributed storage designed to scale to petabyte size databases spread across thousands of servers [1] • Used extensively by Google • Not fully relational • "Sparse, distributed, persistent multidimentional sorted map" [1] • Uses Google File System (GFS) under the hood • Index using row keys • Tablet = range of row keys, used for load balancing
Bigtable • GFS • SSTable • Provides a persistent immutable ordered map • Chubby provides locking mechanism • Ensures single master • Location of bigtable data • Storing schema information and access control lists • Each Bigtable is allocated to one master, and many multiple tablet servers • Master assigns tablets to different tablet servers, dynamically based on server load • Tablets handle read-write
MapReduce • Introduced by Google in 2004 [1] • Often used to operate on Bigtable data [1] • A means to process large amounts of data in a distributed environment in a highly parallelized manner
MapReduce Steps • Input files split into M pieces, multiple copies of program started on cluster • One copy is master, M map tasks, R reduce tasks assigned to idle workers • Worker reads file split contents, passes to map function - results buffered in memory • Buffered results written to local disk periodically, partitioned into R regions by partitioning function, locations passed to master
MapReduce (continued) • Reduce worker notified about location, reads buffered data from map workers, sorts so that same keys are grouped together • Reduce worker passes key and intermediate values to Reduce function, output is appended to final output file • After all map and reduce tasks completed, master wakes up user program
S3 - Simple Storage Service • "Infinite" store for objects of variable size [1] • Organized in 2 levels • Buckets • Like folders, you can save any number of objects in them • Objects • Byte container (up to 5 GB) and metadata (up to 2KB) • Limited search • Single bucket, name only
SimpleDB • Organized into domains (tables) where you can insert data, get data, or run queries [1] • Each domain has items which are descibed by attribute name/value pairs • No schema • API Access- • CreateDomain, DeleteDomain, PutAttributes, DeleteAttributes, GetAttributes, and Select • Meant for fast reads • Keeps multiple copies of the domains
NoSQL • What does this mean? • More about relaxing ACID than being "No" SQL [2] • Lots of open source NoSQL systems • Zynga was big on NoSQL • Why to use them? • Excellent elasticity • Flexible data models - often schema-less • CHEAP (relative to RDBMS) • (if you have lots of frequent and small writes)
Types of NoSQL • Key-value • Redis, Cassandra, etc. • Document store • CouchDB, mongoDB, etc • Graph dbs, object stores • Won't go into these much
Cassandra • Highly scalable, eventually consistent, distributed, structured, key-value store [1] • Open sourced by Facebook (2008) [1] • ColumnFamily based • Column is a tuple of {key, value, timestamp} • ColumnFamilies contain many columns, all referenced by row-key • Kind of like a hybrid of Dynamo and Bigtable [1]
MongoDB • Document-oriented • High input read/write • High availability • Scalability • Flexible query language
References • [1] Sakr, S., Liu, A., Batista, D.M., Alomari, M., A Survey of Large Scale Data Management Approaches in Cloud Environments, IEEE Communications, 2011. • [2] Cloud Computing: Theory and Practice (our lecture notes) • [3] http://www.mongodb.org/display/DOCS/Introduction