C-Store: Data Management in the Cloud

C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009

What is the Cloud? • A definition from Wikipedia • Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet. • Platform as a service (e.g, Amazon EC2) • allows customers to rent computers (virtual machines) on which to run their own computer applications. • Software as a service • infrastructure as a service

Amazon EC2 (Elastic Compute Cloud) • EC2 uses Xenvirtualization. • Each virtual machine, called an "instance", functions as a virtual private server in one of three sizes: • small, large or extra large. • Amazon.com sizes instances based on "EC2 Compute Units" • the equivalent CPU capacity of physical hardware. One EC2 Compute Unit equals 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

Pricing • Amazon charges customers in two primary ways: • Hourly charge per virtual machine • Data transfer charge • Amazon advertising describes the pricing scheme as "you pay for resources you consume".

Advantage of Public Cloud • Public clouds are hosted by large infrastructure companies such as • Amazon, Google, Yahoo, Microsoft, Sun • Can afford huge cloud. • For many companies, especially for start-ups and medium-sized business), setting up a private cloud can be too expensive • hardware cost • Software cost • Personnel cost for maintaining the system

Cloud Characteristics • Computing power is elastic, but noly if workload is parallelizable. • Computing power comes from shared-nothing architecture. • Data is stored at an un-trusted host. • A possible solution is encrypting data. • Data is replicated, often across large geographic distance. • To provide data availability and durability.

Transactional Data Management (OLTP) • Typically does not use a shared-nothing architecture. • OLTP systems are usually less than 1TB in size. • It is hard to maintain ACID guarantees in the face of data replication over large geographic distances. • Google’s Bigtable implements a replicated shared-nothing database, by weaking “A” from ACID. • The H-Store project still remains in vision stage. • There are big risks in storing transactional data on an un-trusted host. • Transactional data include details at the lowest granularity.

First Conclusion • Transactional data management applications are not well suited for deployment in the cloud.

Analytical Data Management (DW) • Tend to be read-mostly (read-only), with occasional batch inserts. • Shared-nothing architecture is a good match. • The ever increasing amount of data is the primary driver for choosing shared-nothing. • Large scans, multidimensional aggregations, and star schema joins for analytical workload are easy to parallelize on shared-nothing system. • Infrequent writes eliminates the need for complex distributed locking and commit protocols.

Analytical Data Management (DW):continued • ACID guarantees are typically not needed. • Snapshot isolation is usually enough. • Particularly sensitive data can often be left out of the analysis. • Less granular versions of the data are usually used for analysis.

Second Conclusion • Analytical Data Management applications are well-suited for deployment in the cloud.

Vertica (C-Store) for the Cloud

Cloud DBMS Wish List • Efficiency • Fault Tolerance • If a query must restart each time a node fails, then long, complex queries are difficult to complete. • Ability to run in a heterogeneous environment. • Should prevent the slowest node from making a disproportionate affect on total query performance. • Ability to operate on encrypted data. • Ability to interface with business intelligence products.

MapReduce vs. Parallel DBMS (1) • Efficiency • MapReduce is good for brute-force scan over unstructured data such as text documents. • Parallel DBMS is good for selective access of structured data. • Fault Tolerance • MapReduce takes it as a high priority. • Most parallel DBMS restart a query upon a faiure. • Ability to run in a heterogeneous environment. • MapReduce does well. • Parallel DBMS are generally designed to run in a homogeneous environment.

MapReduce vs. Parallel DBMS (2) • Ability to operate on encrypted data. • Neither has the native ability to operate on encrypted data. • Ability to interface with business intelligence products. • MapReduce is not intended for interfacing with BI products. • Parallel DBMS supports BI products well.

A Call for A Hybrid Solution • Bring together ideas from MapReduce and Parallel DBMS. • The hybrid solution should combine • Fault tolerance, heterogeneous cluster, and ease of use out-of-the-box capabilities of MapReduce • With the efficiency, performance, and tool plugability of shared-nothing parallel DBMS.

References • Abadi, Daniel J. Data Management in the Cloud: Limitations and Opportunities. In IEEE Data Engineering Bulletin, 2009. • Vertica Company. Getting Started with Vertica Analytic Database for the Cloud. 2009.

C-Store: Data Management in the Cloud

C-Store: Data Management in the Cloud

Presentation Transcript

Places around Town

Organizing Data and Information

Zuora Inc.: Venturing Into CLOUD COMPUTING

RECEIVE AND STORE STOCK

Cloud Examples

Wide-Area Traffic Management for Cloud Services

Architecting to be Cloud Native

The Semantic Web an introduction

Joint work with the Sherpa team in Cloud Computing

Data Management Services in GT2 and GT3

CALPADS REPORTING and DATA MANAGEMENT PRACTICES

BPM in Cloud Architectures: Business Process Management with SLAs and Events

DATA MANAGEMENT FOR THE ALL-DOD CORE ARCHITECTURE DATA MODEL (All_CADM)

Operational Data Store(ODS) Functional Training

Design and Implementation of An RDF Data Store

Data Workflow Management, Data Preservation and Stewardship

Creating a Database

Panda: Public Auditing for Shared Data

Creating a Database

FFY2010

Storage and Data