Jiaheng Lu Renmin Universtiy of China 2009-08-25

中科院软件所 中国人民大学 Cloud-based Data Management: Challenges & Opportunities Jiaheng Lu Renmin Universtiy of China 2009-08-25

Research experience and interesting • National University of Singapore PhD • XML query processing and XML keyword search • University of California, Irvine Postdoc • Approximate string processing • Data integration and data cleaning • Renmin University of China • Cloud data management • XML data management

Outline • Motivation: cloud data management • Database Future and Challenges： • Large-scale Data management & transaction processing • Cloud-based data indexing and query optimization • Recent research work： • An efficient multiple-dimensional indexes for cloud data management • CIKM Workshop CloudDB 2009

Motivation: Internet Chatter

BLOG Wisdom • “If you want vast, on-demand scalability, you need a non-relational database.” Since scalability requirements: • Can change very quickly and, • Can grow very rapidly. • Difficult to manage with a single in-house RDBMS server. • Although RDBMS scale well: • When limited to a single node. • Overwhelming complexity to scale on multiple sever nodes.

Current State • Most enterprise solutions are based on RDBMS technology. • Significant Operational Challenges: • Provisioning for Peak Demand • Resource under-utilization • Capacity planning: too many variables • Storage management: a massive challenge • System upgrades: extremely time-consuming

Internet Search Data Analytics: A Case Study • Data analytics: • Parsed WEB Logs ingested in a RDBMS store. • Hourly and Daily summarization for custom reporting. • Operational nightmare: • Maintaining live reporting system ON at all costs and at all times. • Timely completion of hourly summarization. • Constant tension between Ad-hoc workload versus reporting workload. • Data-driven feedback to live products. • Temporal depth of detailed data

Internet Search Data Analytics: A Case Study • Various solutions explored: • Data Warehousing appliance for fast summarization. • Parallel RDBMS technology for fast ad-hoc queries. • Business Intelligence Products (Data Cubes) for fast and intuitive reporting and analysis. • None of the solutions completely satisfactory: • Plans to migrate low-level data to file-based system to overcome Database scalability bottlenecks

Paradigm Shift in Computing

WEB is replacing the Desktop

What is Cloud Computing? • Old idea: Software as a service (SaaS) • Def: delivering applications over the internet • Recently: “[Hardware, infrastructure, Platform] as a service” • Poorly defined so we avoid all “X as a service” • Utility Computing: pay-as-you-go computing • Illusion of infinite resources • No up-front cost • Fine-grained billing (e.g. hourly)

Why Now? • Experience with very large datacenters • Unprecedented economies of scale • Other factors • Pervasive broadband internet • Pay-as-you-go billing model

Cloud Computing Spectrum • Instruction Set VM (Amazon EC2, 3Tera) • Framework VM • Google AppEngine, Force.com

Cloud Killer Apps • Mobile and web applications • Extensions of desktop software • Matlab, Mathematica • Batch processing/MapReduce

Economics of Cloud Users • Pay by use instead of provisioning for peak

Economics of Cloud Users • Risk of over-provisioning: underutilization

Economics of Cloud Users • Heavy penalty for under-provisioning

Economics of Cloud Providers • 5-7X economies of scale [Hamilton 2008] • Extra benefits • Amazon: utilize off-peak capacity • Microsoft: sell .NET tools • Google: reuse existing infrastructure

Engineering Definition • Providing services on virtual machines allocated on top of a large physical machine pool.

Business Definition • A method to address scalability and availability concerns for large scale applications.

Data Management in the Cloud?

Cloud Computing Implications on DBMSs • Where do Databases fit in this paradigm? • Generational reality: • Animoto.com • Started with 50 servers on Amazon EC2 • Growth of 25,000 users/hour • Need to scale to 3,500 servers in 2 days. • Many similar stories: • RightScale • Joyent • …

Clouded Data? • Reality Number Ⅰ： • Unlimited processing assumption • Interactive page views: • By targeting large number of SQL queries against MySQL • Still Expect sub-millisecond object retrieval • Reality Number Ⅱ: • Why can’t the database tier be replicated in the same way as the Web Server and App Server can? →These are the major challenges for Data Management in the cloud.

The Vision • R&D Challenges at the macro level: • Where and how does the DBMS fit into this model. • R&D Challenges at micro level: • Specific technology components that must be developed to enable the migration of enterprise data into the clouds.

Data and Networks: Attempt Ⅰ • Distributed Database (1980s): • Idealized view: unified access to distributed data • Prohibitively expensive: global synchronization • Remained a laboratory prototype: • Associated technology widely in-use: 2PC

Data and Networks: Attempt Ⅱ

Data and Networks: Pragmatics

Database on S3: SIGMOD’08 • Amazon’s Simple Storage Service(S3): • Updates may not preserve initiation order • No “force” writes • Eventual guarantee • Proposed solution: • Pending Update Queue • Checkpoint protocol to ensure consistent ordering • ACID: only Atomicity + Durability

Unbundling Txns in the Cloud • Research results: • CIDR’09 proposal to unbundle Transactions Management for Cloud Infrastructures • Attempts to refit the DBMS engine in the cloud storage and computing

Analytical Processing

Architectural and System Impacts • Current state: • MapReduce Paradigm for data analysis • What is missing: • Auxiliary structures and indexes for associative access to data (i.e., attribute-based access) • Caveat: inherent inconsistency and approximation • Future projection: • Eventual merger of databases (ODSs) and data warehouses as we have learned to use and implement them.

Underlying Principles: CIDR’2009 • Business data may not always reflect the state of the world or the business: • Inherent lack of perfect information • Secondary data need not be updated with primary data: • Inherent latency • Transactions/Events may temporarily violate integrity constraints: • Referential integrity may need to be compromised

Data Security & Privacy • Data privacy remains a show-stopper in the context of database outsourcing. • Encryption-based solutions are too expensive and are projected to be so in the foreseeable future: • Private Information Retrieval (Sion’2008) • Other approaches: • Information-theoretic approaches that uses data-partitioning for security (Emekci’2007) • Hardware-based solution for information security

Self management and self tuning in cloud-based data management • Self management and self tuning • Query optimization on thousands of nodes

Remarks • Data Management for Cloud Computing poses a fundamental challenge to database researchers: • Scalability • Reliability • Data Consistency • Radically different approaches and solution are warranted to overcome this challenge: • Need to understand the nature of new applications

References • Life Beyond Distributed Transactions: An Apostate’s Opinion by P.Helland, CIDR’07 • Building a Database on S3 M.Brartner, D.Florescu, D.Graf, D.Kossman, T.Kraska, SIGMOD’08 • Unbundling Transaction Services in the Cloud D.Lo,et, A.Fekete, G.Weikum, M.Zwilling, CIDR’09 • Principles of Inconsistency S.Finkelstein, R.Brendle, D.Jacobs, CIDR’09 • VLDB Database School (China) 2009 http://www.sei.ecnu.edu.cn/~vldbschool2009/VLDBSchool2009English.htm

An Efficient Multi-Dimensional Index for Cloud Data Management CIKM workshop CloudDB09

Outline INTRODUCTION MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE Extended Nodes partition Node partition Cost Estimation Strategy EVALUATION

Cloud Computing Google File System Yahoo PNUTS

Distributed Cloud base? • BigTable How to query on other attributes besides primary key? • HBase

Distributed Index: Single Dimension? S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. H. chih Yang and D. S. Parker, “Traverse: Simplified indexing on large map-reduce-merge clusters,” in Proceedings of DASFAA 2009, Brisbane, Australia, April 2009, pp. 308–322. M. K. Aguilera, W. Golab, and M. A. Shah, “A practical scalable distributed b-tree,” in Proceedings of VLDB’08, Auckland, New Zealand, August 2008, pp. 598–609.

Framework of Request Processing in Cloud

R-Tree R-trees is a tree data structure that is similar to a B-tree, but is used for spatial access methods

KD-Tree kd-tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space.

R-Tree & KD-Tree: RKDTree Master range ： 6800～9000, 3400~8900 range ： 2000～40000, 3400~8900 range ： 6300～7000, 599~1400 range： 0～2000, 500~1200 range： 800～3500, 300~1300 Slave Slave Slave Slave Slave

Random cutting: Pick several random values on the attributeand cut by the points. with the random method you may receivegreat performance, but also possible to have poor performance. Equal cutting: Cut the attribute into several equal intervals.This method is relatively stable since no extreme case willhappen. Clustering-based cutting: Cut the attribute by clustering valueson the attribute and cut between clusters. This methodmay receive foreseeable better performance, but the time costis also apparently higher. The time complexity of a clusteringalgorithm is typically O(nlogn) or even higher. Nodes partition for data summary

Nodes partition Random cutting Equal cutting Clustering-based cutting

Jiaheng Lu Renmin Universtiy of China 2009-08-25

Jiaheng Lu Renmin Universtiy of China 2009-08-25

Presentation Transcript

Renmin University of China School of Finance

2009 , 25

Science programs in universtiy

Renmin University of China School of Finance

Zhang Yu and Zhao Feng (Department of Economics, Renmin University of China)

Renmin University of China School of Finance

JIA Fan School of Finance Renmin University of China

陆嘉恒 2009-08-25

08/2009

Run Cao School of Information Renmin University of China

Capacity Building and Training in China Du Peng Renmin University of China

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore

08-2009

Renmin University of China School of Finance

Renmin University of China School of Finance

Erming Xu, Ph.D. School of Business Renmin University of China emxu@ruc

By Li Shoujing Hu Bo Renmin University of China