Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA

Index for Cloud Data Management Lab of Web And Mobile Data Management（WAMDM） Youzhong MA

Outline • Motivating Applications • Existing Technologies • Conclusions & Future work

Motivating Application select sum(number) from Product whereproduct.name = ‘beer’ and product.price<=10$ andproduct.price>=5$ Cloud System Queries with multi-attributes and non-rowkey are quite common ! Table：Product Big Data in a Private Cloud

Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon • Motivating Application: Mobile Coupon Distribution Mobile Coupon Distributer Page 4

System Scalability Efficient Complex Queries Large amounts of Data High Throughput Multi-Dimensional Query Nearest Neighbors Query Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon Coupon Coupon Motivating Application: Mobile Coupon Distribution 125,000,000 subscribers in Japan Page 5

Existing Technologies at a reasonable price

Solutions-overview Local Index + Global Index CAS NEC

Efficient B-tree Based Indexing for Cloud Data ProcessingS. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. PVLDB'10

Efficient B-tree Based Indexing for Cloud Data Processing • Motivation • Designing a scalable and high-throughput indexing scheme to support efficient query for huge volumes of data in cloud • Low maintenance cost but also support parallel search

System Architecture BATON overlay network publish Local Index

Challenges • How to select the local B+-tree nodes to publish in Global index? • How to organize the global index? • How to maximize the throughput?

Selecting local B+-tree nodes • Cost modeling • Query cost • routing cost： • local search cost： • Update cost ：cost of sending an index message ：cost of random I/O 1：Search in global index 2：Search in local index

Adaptive indexing strategy • Index expand • Index collapse Local Index

BATON：Balanced Tree Overlay Network • A distributed tree structure for P2P systems • Supporting range search

Index Construction • Assign a range to each node • For each node n • The range of its left sub-tree is less than that of n • The range of its right sub-tree is larger than that of n

Publish local B+-tree node to BATON

Maximizing the throughput • Eventual consistent model • Lazy update • if the update does not affect the key range of a local B+-tree, the stale index will not affect the correctness of the query processing. • Eager update • updates in the Left-most and right-most nodes

Pros and cons • Pros • Supporting efficient point query and range query for non-rowkey • Proposed an adaptive indexing strategy based on the cost model of overlay routings • Cons • Can not support multi-dimensional query

Multi-dimensional index [X.ZhangCloudDB’09]

Multi-dimensional index [J.WangSIGMOD’10] [G.ChenVLDB’11]

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware ServicesShoji Nishimura, Sudipto Das. MDM'11

Contributions • Using linearization to implement a scalable multi-dimensional index structure layered over a range-partitioned Key-value store • Implementing a K-d tree and a Quad tree by the design

Buckets Index key00 key11 value00 value11 But, our target is multi-dimensional… key01 key12 value01 value12 key00 key11 key1Y key0X value0X value1Y keynn Latitude Time keynn valuenn Longitude Ordered Key-Value Stores Sorted by key Good at 1-D Range Query

key11 key00 value11 value00 key01 key12 value12 value01 key0X key1Y value0X value1Y Naïve Solution: Linearlization Projects n-D space to 1-D space Apply a Z-ordering curve… key00 key11 keynn keynn valuenn Simple, but problematic…

9 2 Problem: False positive scans • MD-query on Linearized space • Translate a MD-query to linearized range query. • Ex. Query from 2 to 9. • Scan queried linearized range. • Filter points out of the queried area. • ex. blue-hatched area (4 to 7) Require the boundary information of the original space.

MD-HBase • Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase Multi-Dimensional Index Single Dimensional Index

Space Partition By the K-d tree Partitioned space by the K-d tree Binary Z-ordering space bitwise interleaving 11 10 01 00 11 10 01 00 00 01 10 11 00 01 10 11 How do we represent these subspaces?

*→0 *→1 1000 1111 (10, 00) (11, 11) Left-bottom corner Right-top corner Key Idea: The longest common prefix naming scheme Subspaces represented as the longest common prefix of keys! • Remarkable Property • Preserve boundary informationof the original space 11 10 01 00 1*** 00 01 10 11 000* 1***

Build an index with the longest common prefix of keys Buckets 000* Index 11 10 01 00 001* 01** 1*** 01** 000* 001* 1*** 00 01 10 11 allocate per subspace

Multi-dimensional Range Query Scan 0010 -1001 on the index 000* Index Subspace Pruning 11 10 01 00 001* Scan Filter 01** 10** Scan 00 01 10 11 11** Reconstruct the boundary Info. & Check whether intersecting the queried area

Variations of Storage Layer table buckets • Table Share Model • Use single table, Maintain bucket boundary • Most space efficiency • Table per Bucket Model • Allocate a table per bucket • Most flexible mapping • One-to-one, one-to-many, many-to-one • Bucket split is expensive • Copy all points to the new buckets. • Region per Bucket Model • Allocate a region per bucket • Most bucket split efficiency • Require modification of HBase

Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 timesfaster than others and responses proportional time to selectivity. Experimental Results: Multi-dimensional Range Query

Dataset: spatially skewed data MD-HBase shows good scalability without significant overhead. Experimental Results: Insert

Conclusions • Designed a scalable multi-dimensional data store. • Mapping multi-dimension to single dimension • Key Idea: indexingthe longest common prefix of keys • Demonstrated scalable insert throughput and excellent query performance. • Range Query: 10-100 times faster than existing technologies. • Insert: 220K inserts/sec on 16nodes cluster without overhead

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range QueriesY. Zou, J. Liu, S. Wang. NPC’10 end

Introduction • Motivation • Building index in DOTs to support multi-dimensional range query • High performance, low space overhead, high reliability • DOT • Distributed Ordered Table • BigTable，HBase • Observations • Usually 3 to 5 replica in DOTs • Index number is usually less than 5 • Random read is significantly slower than scan

Basic idea：Complemental Clustering Index CCIT： convert slow random reads to fast sequential scan CCT： for fast data recovery

Challenges • Performance • Reliability • Space overhead

Performance Query optimization based on the region-to-server mapping information • HBase 0.20.1 • 16 nodes • 90 million records

Reliability: Fault tolarance • Get other index value from CCTs • Query the CCITs to recover data • Replicate CCTs

Space overhead • N：the index column number • X-axis • Length of record to length of index columns • Y-axis • Overhead ratio

Conclusions • Proposed CCIndex to support Multi-dimensional range query in DOTs • Not suitable for more than 5 index columns • Write operation is slower than the original table

Conclusions • Index for non-rowkey in cloud data management system • Solutions • Local index + global index • Linearlization • Secondary index • Key issues • Index reliability • Query result correctness • Index maintenance • …

Future work • Study the architecture of HDFS and Hbase in detail • Test the existing index solutions in Cloud • Index framework and index structure

References • M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598–609, 2008. • Y. Zou, J. Liu, S. Wang. CCIndex: a Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC’10. • S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. • J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In SIGMOD, 2010. • S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. PVLDB, 3(1):1207–1218, 2010. • X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng, “An efficient multidimensional index for cloud data management,” in CloudDB, 2009, pp.17–24. • Shoji Nishimura, Sudipto Das. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. MDM2011.

Thank you

Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA

Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA

Presentation Transcript

Mobile Development

FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

CDISC SDTM et Data Management Isabelle Abousahl Head of Data Management Elisabeth Campain-Teulon Data Warehouse Manage

DATA MANAGEMENT FOR THE ALL-DOD CORE ARCHITECTURE DATA MODEL (All_CADM)

Windows Mobile Device Management

Join Using MapReduce

Database Management Systems

“Essential Elements for Data Quality…” Data Quality Management Control Program

Mobile Transit Planning with Real Time Data

Dr. Yukun Bao School of Management, HUST

UNIT - I Mobile Communication and Mobile Computing

MMDSS 2007 Data stream management and mining

Mobile Communication and Mobile Computing

The Simple Secure Solution for protecting your mobile phone data

Product Data Management (PDM) Engineering Data Management (EDM)

Connecting to the Web Using Mobile Devices

GSM (GLOBAL SYSTEM FOR MOBILE COMMUNICATION)

Mobile Services

jQuery Mobile

Warlord Mobile Leads review demo and $14800 bonuses