230 likes | 575 Views
ES 2 : A Cloud Storage for Supporting both OLTP and OLAP. Yu Cao, Chun Chen, Fei Guo, Dawei Jiang, Yuting Lin, Beng Chin Ooi, Hoang Tam Vo , Sai Wu, Quanqing Xu. “NoSQL” vs. “CloudDB”. Mix ad-hoc requirements with trying to find/develop open sources. Amazon SimpleDB. BigTable.
E N D
ES2: A Cloud Storage for Supporting both OLTP and OLAP Yu Cao, Chun Chen, Fei Guo, Dawei Jiang, Yuting Lin, Beng Chin Ooi, Hoang Tam Vo, Sai Wu, Quanqing Xu
“NoSQL” vs. “CloudDB” Mix ad-hoc requirements with trying to find/develop open sources Amazon SimpleDB BigTable Azure Table Storage HBase Cassandra PNUTS Scalaris Hypertable InfoGrid SciDB HypergraphDB MemcacheDB LightCloud Objectivity Perst GenieDB CouchDB KAI Know targets/build proprietary systems MongoDB Dynamo Voldemort Dynomite and variants...(http://nosql-database.org)
System Design by Choice/Need Web 2.0 Analytics DBMS OLTP & OLAP OLAP queries OLTP & Updates MRDryadHive Pig… Processing Engine QP engine QP engine ETL Storage engine Storage engine Reader DBMS Data Warehouse DFS databases HBaseCassandraBigtable… Problems: 1. Data Freshness/Real-time Search 2. Storage/Investment Cost files streams/query results
Challenges from applications • Real-time analysis • Updatable warehouse • Database as a Service (DaaS) yes new order available to promise? request supplier place order aggregating stock level no Challenge 1:How to support both OLTP and OLAP within the same storage and processing engine Challenge 2:Similar functions as centralized DBMSes such as indexes, but in a scalable/elastic cloud environment
Design of Cloud Data Management Systems • Goals • Scalable DaaS • Integrate OLTP and OLAP into one data storage and processing system without loss of performance • Scalable OLTP • Redesign OLTP module to tackle huge volumes of insertions/deletions • Low latency is important • Scalable OLAP • Redesign OLAP module to perform cloud-scale data analysis • Fault tolerance support in query processing
epiC: elastic power-aware data intensive Cloud Users/Business • One data management system instance • Shared-nothing structure • Run on all nodes • Integrate OLTP and OLAP • OLTP and OLAP are separate modules (not separate systems) • Share the same storage layer • Dispatch workload based on query type Query Dispatcher Data Management System OLTP OLAP Storage Layer (ES2) Virtual Machines Cloud Environment More info at epiCproject http://www.comp.nus.edu.sg/~epiC
ES2 – an elastic cloud storage system • Key features • Elastic scaling • Hybrid storage - supporting both OLTP and OLAP • Flexible data partitioning based on the database workload • Load-adaptive replication • Transactional semantics for bundled updates • DBMS-like index functionality • Multiple indexes of different types • e.g. hash, range, multi-dimensional, bitmap indexes • Comparisons to other systems • Cassandra, Pnuts and MegaStore Towards Elastic Transactional Cloud Storage with Range Query Support H.T. Vo, B.C. Ooi, C. Chen. PVLDB 3(1): 506-517 (2010)
Architecture of ES2 Data Requests Data Access Control Data Access Interface Data Manipulator Storage Components Plain File, Database, Application Results Import Manager Meta-data Catalog Distributed Indexing Write Cache Distributed File System Data Import Control
1,2Alice,Fred32,49 1,22.5,3HR,FI Hybrid Data Partitioning Horizontal partitions Vertical partitions Logical table 3,4,5Malice,Fred,Smith37,24,30 Workload trace selectname fromEmp whereage > 35 selectavg(salary) fromEmp group by deptupdate Emp set salary=4K where id=4… 3,4,54,3.5,6MA,FI,HR PAX Storage Layout [PAX] Weaving Relations for Cache Performance. A. Ailamaki, D. J. DeWitt, M. D. Hill, M. Skounakis. VLDB 2001: 169-180
Hybrid Data Partitioning • The rationales • Vertical partitioning • Group frequently accessed columns together • Minimize disk I/Os and improve query performance • Horizontal partitioning • Facilitate parallelism • Minimize the number of distributed transactions • PAX storage layout • Cache-conscious storage layout • Improve CPU cache hits and OLAP performance
Distributed Indexing • Why Index? • OLTP queries: high selectivity, low latency expectation • Cloud storage system: huge volume of data • Parallel scan: scan 1T data to get 10 tuples? • Why Distributed? • Central server may become bottleneck • Facilitate parallelism and load balance • Objectives • Provide DBMS-like indexes • Multiple indexes of different types (hash, range, multi-dimensional, bitmap) • Extensibility of indexes a separate research issue
Idea from P2P Network • Each cluster node • acts as a peer in the P2P overlay • maintains “local” indexes such as B+-trees and R-trees • Index building • when data are imported • publish the index entries to different indexes based on P2P routing protocols
Q(x,y) key h(key1) Challenges of Distributed Indexes E y • Different overlays are required to support different types of indexes • BATON for B-tree [1] • CAN for R-tree [2] • Chord for Hash • Overlay routing and maintenance cost is too high • Load balancing issue A (x,y) Peer C h(key2) F B D 1.Efficient B+-tree Based Indexing for Cloud Data ProcessingS. Wu, D. Jiang, B. C. Ooi, K. L. Wu. VLDB 2010 2. Indexing Multi-dimensional Data in a Cloud SystemJ. Wang, S. Wu, H. Gao, J. Li, B. C. Ooi. ACM SIGMOD , 2010.
Distributed Index Architecture • Optimizations: • Index + base table vs. Index-only plan • materialize portion of data record • Adaptive network connections
Data Access Optimizer Parallel scan Index scan Estimate data access cost Query Q cpscanvs. ciscan Data access plan
BIDS: Bitmap Index for Database Service in the Cloud • Challenges of supporting a large number of indexes • Large index data • Compact size of BIDS • BIDS supports a wider range of queries • If the queries only involve indexed attributes, we can completely answer queries via the indexes
Query Processing with BIDS • For TPC-H Q6: • Retrieve the following bitmap indexes: • B1 : x<=shipdate<x+1 • B2 : y<=discount<y+2 • B3 : quantity<z • B4 : bitmap index for extendedprice • B5 : bitmap index for discount • Filter=B1 And B2 And B3 • For tuples pass the Filter, compute • extendedprice*discount via B4 and B5 SELECT sum(extendedprice*discount) as revenueFROM LineitemWHERE shipdate≥ x and shipdate < x + 1 year and discount ≥ y and discount < y + 0.02 and quantity < z
BIDS: Bitmap Index for Database Service in the Cloud • A column has too many unique values? • The size of bitmap index may be larger than the original dataset • Compression solutions • WAH encoding [1] • Bit-sliced encoding [2] • Partial index • All indexes are buffered in the distributed memory • Index update? • Kesheng Wu, Ekow J. Otoo, and ArieShoshani. Compressing Bitmap Indexes for Faster Search Operations. In SSDBM’02. • 2. Denis Rinfret, Patrick O'Neil, and Elizabeth O'Neil. Bit-sliced index arithmetic. In SIGMOD’01.
Evaluation • TPC-H dataset scale 30GB • System size: 5 – 35 nodes • Multi-dimensional query • Distributed multi-dimensional index on (totalprice, orderdate) attribute of Orders table • Base approach? SELECT custkey, count(orderkey), sum(totalprice) FROM Orders WHERE totalprice≥ y and totalprice ≤ y + 100 and orderdate≥ z and orderdate ≤ z + 1 month GROUP BY (custkey)
Evaluation • Data freshness observed by OLAP scans in concurrence with OLTP updates • System size: 64 nodes • Data size: 32GB to 512GB • Update rate: 5 nodes, each submits 100ops/sec, follows uniform and normal distribution • Metrics: maximal version difference • Comparisons: es2 vs. recent
Other results • Data import performance • Metadata catalog maintenance • Additional index performance • OLAP query performance epiC system More info on epiCproject -- http://www.comp.nus.edu.sg/~epiC
Thank you! Questions & Answers