480 likes | 600 Views
Index for Cloud Data Management. Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA. Outline. Motivating Applications E xisting Technologies Conclusions & Future work . Motivating Application. select sum(number) from Product where product.name = ‘beer’
E N D
Index for Cloud Data Management Lab of Web And Mobile Data Management(WAMDM) Youzhong MA
Outline • Motivating Applications • Existing Technologies • Conclusions & Future work
Motivating Application select sum(number) from Product whereproduct.name = ‘beer’ and product.price<=10$ andproduct.price>=5$ Cloud System Queries with multi-attributes and non-rowkey are quite common ! Table:Product Big Data in a Private Cloud
Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon • Motivating Application: Mobile Coupon Distribution Mobile Coupon Distributer Page 4
System Scalability Efficient Complex Queries Large amounts of Data High Throughput Multi-Dimensional Query Nearest Neighbors Query Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon Coupon Coupon Motivating Application: Mobile Coupon Distribution 125,000,000 subscribers in Japan Page 5
Outline • Motivating Applications • Existing Technologies • Conclusions & Future work
Existing Technologies at a reasonable price
Solutions-overview Local Index + Global Index CAS NEC
Efficient B-tree Based Indexing for Cloud Data ProcessingS. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. PVLDB'10
Efficient B-tree Based Indexing for Cloud Data Processing • Motivation • Designing a scalable and high-throughput indexing scheme to support efficient query for huge volumes of data in cloud • Low maintenance cost but also support parallel search
System Architecture BATON overlay network publish Local Index
Challenges • How to select the local B+-tree nodes to publish in Global index? • How to organize the global index? • How to maximize the throughput?
Selecting local B+-tree nodes • Cost modeling • Query cost • routing cost: • local search cost: • Update cost :cost of sending an index message :cost of random I/O 1:Search in global index 2:Search in local index
Adaptive indexing strategy • Index expand • Index collapse Local Index
BATON:Balanced Tree Overlay Network • A distributed tree structure for P2P systems • Supporting range search
Index Construction • Assign a range to each node • For each node n • The range of its left sub-tree is less than that of n • The range of its right sub-tree is larger than that of n
Maximizing the throughput • Eventual consistent model • Lazy update • if the update does not affect the key range of a local B+-tree, the stale index will not affect the correctness of the query processing. • Eager update • updates in the Left-most and right-most nodes
Pros and cons • Pros • Supporting efficient point query and range query for non-rowkey • Proposed an adaptive indexing strategy based on the cost model of overlay routings • Cons • Can not support multi-dimensional query
Multi-dimensional index [X.ZhangCloudDB’09]
Multi-dimensional index [J.WangSIGMOD’10] [G.ChenVLDB’11]
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware ServicesShoji Nishimura, Sudipto Das. MDM'11
Contributions • Using linearization to implement a scalable multi-dimensional index structure layered over a range-partitioned Key-value store • Implementing a K-d tree and a Quad tree by the design
Buckets Index key00 key11 value00 value11 But, our target is multi-dimensional… key01 key12 value01 value12 key00 key11 key1Y key0X value0X value1Y keynn Latitude Time keynn valuenn Longitude Ordered Key-Value Stores Sorted by key Good at 1-D Range Query
key11 key00 value11 value00 key01 key12 value12 value01 key0X key1Y value0X value1Y Naïve Solution: Linearlization Projects n-D space to 1-D space Apply a Z-ordering curve… key00 key11 keynn keynn valuenn Simple, but problematic…
9 2 Problem: False positive scans • MD-query on Linearized space • Translate a MD-query to linearized range query. • Ex. Query from 2 to 9. • Scan queried linearized range. • Filter points out of the queried area. • ex. blue-hatched area (4 to 7) Require the boundary information of the original space.
MD-HBase • Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase Multi-Dimensional Index Single Dimensional Index
Space Partition By the K-d tree Partitioned space by the K-d tree Binary Z-ordering space bitwise interleaving 11 10 01 00 11 10 01 00 00 01 10 11 00 01 10 11 How do we represent these subspaces?
*→0 *→1 1000 1111 (10, 00) (11, 11) Left-bottom corner Right-top corner Key Idea: The longest common prefix naming scheme Subspaces represented as the longest common prefix of keys! • Remarkable Property • Preserve boundary informationof the original space 11 10 01 00 1*** 00 01 10 11 000* 1***
Build an index with the longest common prefix of keys Buckets 000* Index 11 10 01 00 001* 01** 1*** 01** 000* 001* 1*** 00 01 10 11 allocate per subspace
Multi-dimensional Range Query Scan 0010 -1001 on the index 000* Index Subspace Pruning 11 10 01 00 001* Scan Filter 01** 10** Scan 00 01 10 11 11** Reconstruct the boundary Info. & Check whether intersecting the queried area
Variations of Storage Layer table buckets • Table Share Model • Use single table, Maintain bucket boundary • Most space efficiency • Table per Bucket Model • Allocate a table per bucket • Most flexible mapping • One-to-one, one-to-many, many-to-one • Bucket split is expensive • Copy all points to the new buckets. • Region per Bucket Model • Allocate a region per bucket • Most bucket split efficiency • Require modification of HBase
Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 timesfaster than others and responses proportional time to selectivity. Experimental Results: Multi-dimensional Range Query
Dataset: spatially skewed data MD-HBase shows good scalability without significant overhead. Experimental Results: Insert
Conclusions • Designed a scalable multi-dimensional data store. • Mapping multi-dimension to single dimension • Key Idea: indexingthe longest common prefix of keys • Demonstrated scalable insert throughput and excellent query performance. • Range Query: 10-100 times faster than existing technologies. • Insert: 220K inserts/sec on 16nodes cluster without overhead
CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range QueriesY. Zou, J. Liu, S. Wang. NPC’10 end
Introduction • Motivation • Building index in DOTs to support multi-dimensional range query • High performance, low space overhead, high reliability • DOT • Distributed Ordered Table • BigTable,HBase • Observations • Usually 3 to 5 replica in DOTs • Index number is usually less than 5 • Random read is significantly slower than scan
Basic idea:Complemental Clustering Index CCIT: convert slow random reads to fast sequential scan CCT: for fast data recovery
Challenges • Performance • Reliability • Space overhead
Performance Query optimization based on the region-to-server mapping information • HBase 0.20.1 • 16 nodes • 90 million records
Reliability: Fault tolarance • Get other index value from CCTs • Query the CCITs to recover data • Replicate CCTs
Space overhead • N:the index column number • X-axis • Length of record to length of index columns • Y-axis • Overhead ratio
Conclusions • Proposed CCIndex to support Multi-dimensional range query in DOTs • Not suitable for more than 5 index columns • Write operation is slower than the original table
Outline • Motivating Applications • Existing Technologies • Conclusions & Future work
Conclusions • Index for non-rowkey in cloud data management system • Solutions • Local index + global index • Linearlization • Secondary index • Key issues • Index reliability • Query result correctness • Index maintenance • …
Future work • Study the architecture of HDFS and Hbase in detail • Test the existing index solutions in Cloud • Index framework and index structure
References • M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598–609, 2008. • Y. Zou, J. Liu, S. Wang. CCIndex: a Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC’10. • S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. • J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In SIGMOD, 2010. • S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. PVLDB, 3(1):1207–1218, 2010. • X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng, “An efficient multidimensional index for cloud data management,” in CloudDB, 2009, pp.17–24. • Shoji Nishimura, Sudipto Das. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. MDM2011.