MD- HBase : A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services S. Nishimura (NEC Service Platforms Labs.), S. Das, D. Agrawal, A. Abbadi(University of California, Santa Barbara) Presenter: Zhuo Liu

Overview • A Motivating Story • Existing Technologies • Our proposal • Evaluation • Conclusion

Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon Motivating Scenario: Mobile Coupon Distribution Mobile Coupon Distributer

System Scalability Efficient Complex Queries Large amounts of Data High Throughput Multi-Dimensional Query Nearest Neighbors Query Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Current Location Coupon Coupon Coupon Motivating Scenario: Mobile Coupon Distribution 125,000,000 subscribers in Japan

Existing Technologies at a reasonable price

Buckets Index key00 key11 value11 value00 But, our target is multi-dimensional… key12 key01 value01 value12 key00 key11 key0X key1Y value0X value1Y keynn Latitude Time keynn valuenn Longitude Ordered Key-Value Stores Sorted by key Good at 1-D Range Query ex. BigTable HBase

key11 key00 value11 value00 key01 key12 value12 value01 key0X key1Y value0X value1Y Naïve Solution: Linearlization Projects n-D space to 1-D space Apply a Z-ordering curve… key00 key11 keynn keynn valuenn Simple, but problematic…

9 2 Problem: False positive scans • MD-query on Linearized space • Translate a MD-query to linearized range query. • Ex. Query from 2 to 9. • Scan queried linearized range. • Filter points out of the queried area. • ex. blue-hatched area (4 to 7) Require the boundary information of the original space.

Our Approach: MD-HBase • Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase Multi-Dimensional Index Single Dimensional Index

Introduce Multi-dimensional Index • Multi-dimensional Index (ex. The K-d tree, The Quad tree) • Divide a space into subspaces containing almost same # of points • Organize subspaces as tree Efficient subspace pruning → to avoid false positive scans Organize as Divide into

Space Partition By the K-d tree Binary Z-ordering space Partitioned space by the K-d tree bitwise interleaving ex. x=00, y=11 → 0101 11 10 01 00 11 10 01 00 00 01 10 11 00 01 10 11 How do we represent these subspaces?

*→0 *→1 1000 1111 (10, 00) (11, 11) Left-bottom corner Right-top corner Key Idea: The longest common prefix naming scheme Subspaces represented as the longest common prefix of keys! • Remarkable Property • Preserve boundary informationof the original space 11 10 01 00 1*** 00 01 10 11 000* 1***

Build an index with the longest common prefix of keys Buckets 000* Index 11 10 01 00 000* 001* 01** 001* 1*** 01** 01** 1*** 000* 001* 1*** 00 01 10 11 allocate per subspace

Multi-dimensional Range Query Scan 0010 -1001 on the index 000* Index Subspace Pruning 11 10 01 00 001* Scan Filter 01** 10** Scan 00 01 10 11 11** Reconstruct the boundary Info. & Check whether intersecting the queried area

K Nearest Neighbors Query • The best first algorithm can be applied. • the most efficient technique in practical case • Check the detail in our paper 5 4 3 1 2

Variations of Storage Layer • Table Share Model • Uses single table, Maintain bucket boundary • Most space efficiency • Bucket co-location may cause disk access congestions • Table per Bucket Model • Allocates a table per bucket • Most flexible mapping • One-to-one, one-to-many, many-to-one • Bucket split is expensive • Copy all points to the new buckets. • Region per Bucket Model • Allocates a region per bucket • Most bucket split efficiency • Asynchronous bucket split • Requires modification of HBase

Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses 10~100 timesfaster than others and responses proportional time to selectivity. Experimental Results: Multi-dimensional Range Query

Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000 Experimental Results: kNearest Neighbors Query

Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead. Experimental Results: Insert

Conclusions • Designed a scalable multi-dimensional data store. • Scalability & Efficient multi-dimensional queries • Key Idea: indexingthe longest common prefix of keys • Easily extend general ordered key-value stores. • Demonstrated scalable insert throughput and excellent query performance. • Range Query: 10-100 times faster than existing technologies. • kNN Query: 1.5 s when k ≦ 100. • Insert: 220K inserts/sec on 16nodes cluster without overhead Thank you. Any Questions?

MD- HBase : A Scalable Multi-dimensional Data Infrastructure for Location Aware Services