170 likes | 293 Views
C-Store: Data Management in the Cloud. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009. What is the Cloud?. A definition from Wikipedia
E N D
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009
What is the Cloud? • A definition from Wikipedia • Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet. • Platform as a service (e.g, Amazon EC2) • allows customers to rent computers (virtual machines) on which to run their own computer applications. • Software as a service • infrastructure as a service
Amazon EC2 (Elastic Compute Cloud) • EC2 uses Xenvirtualization. • Each virtual machine, called an "instance", functions as a virtual private server in one of three sizes: • small, large or extra large. • Amazon.com sizes instances based on "EC2 Compute Units" • the equivalent CPU capacity of physical hardware. One EC2 Compute Unit equals 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
Pricing • Amazon charges customers in two primary ways: • Hourly charge per virtual machine • Data transfer charge • Amazon advertising describes the pricing scheme as "you pay for resources you consume".
Advantage of Public Cloud • Public clouds are hosted by large infrastructure companies such as • Amazon, Google, Yahoo, Microsoft, Sun • Can afford huge cloud. • For many companies, especially for start-ups and medium-sized business), setting up a private cloud can be too expensive • hardware cost • Software cost • Personnel cost for maintaining the system
Cloud Characteristics • Computing power is elastic, but noly if workload is parallelizable. • Computing power comes from shared-nothing architecture. • Data is stored at an un-trusted host. • A possible solution is encrypting data. • Data is replicated, often across large geographic distance. • To provide data availability and durability.
Transactional Data Management (OLTP) • Typically does not use a shared-nothing architecture. • OLTP systems are usually less than 1TB in size. • It is hard to maintain ACID guarantees in the face of data replication over large geographic distances. • Google’s Bigtable implements a replicated shared-nothing database, by weaking “A” from ACID. • The H-Store project still remains in vision stage. • There are big risks in storing transactional data on an un-trusted host. • Transactional data include details at the lowest granularity.
First Conclusion • Transactional data management applications are not well suited for deployment in the cloud.
Analytical Data Management (DW) • Tend to be read-mostly (read-only), with occasional batch inserts. • Shared-nothing architecture is a good match. • The ever increasing amount of data is the primary driver for choosing shared-nothing. • Large scans, multidimensional aggregations, and star schema joins for analytical workload are easy to parallelize on shared-nothing system. • Infrequent writes eliminates the need for complex distributed locking and commit protocols.
Analytical Data Management (DW):continued • ACID guarantees are typically not needed. • Snapshot isolation is usually enough. • Particularly sensitive data can often be left out of the analysis. • Less granular versions of the data are usually used for analysis.
Second Conclusion • Analytical Data Management applications are well-suited for deployment in the cloud.
Cloud DBMS Wish List • Efficiency • Fault Tolerance • If a query must restart each time a node fails, then long, complex queries are difficult to complete. • Ability to run in a heterogeneous environment. • Should prevent the slowest node from making a disproportionate affect on total query performance. • Ability to operate on encrypted data. • Ability to interface with business intelligence products.
MapReduce vs. Parallel DBMS (1) • Efficiency • MapReduce is good for brute-force scan over unstructured data such as text documents. • Parallel DBMS is good for selective access of structured data. • Fault Tolerance • MapReduce takes it as a high priority. • Most parallel DBMS restart a query upon a faiure. • Ability to run in a heterogeneous environment. • MapReduce does well. • Parallel DBMS are generally designed to run in a homogeneous environment.
MapReduce vs. Parallel DBMS (2) • Ability to operate on encrypted data. • Neither has the native ability to operate on encrypted data. • Ability to interface with business intelligence products. • MapReduce is not intended for interfacing with BI products. • Parallel DBMS supports BI products well.
A Call for A Hybrid Solution • Bring together ideas from MapReduce and Parallel DBMS. • The hybrid solution should combine • Fault tolerance, heterogeneous cluster, and ease of use out-of-the-box capabilities of MapReduce • With the efficiency, performance, and tool plugability of shared-nothing parallel DBMS.
References • Abadi, Daniel J. Data Management in the Cloud: Limitations and Opportunities. In IEEE Data Engineering Bulletin, 2009. • Vertica Company. Getting Started with Vertica Analytic Database for the Cloud. 2009.