1 / 20

MD- HBase : A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

MD- HBase : A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. S. Nishimura (NEC Service Platforms Labs.), S. Das, D. Agrawal, A. Abbadi (University of California, Santa Barbara). Presenter: Zhuo Liu. Overview. A Motivating Story Existing Technologies

Download Presentation

MD- HBase : A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services S. Nishimura (NEC Service Platforms Labs.), S. Das, D. Agrawal, A. Abbadi(University of California, Santa Barbara) Presenter: Zhuo Liu

  2. Overview • A Motivating Story • Existing Technologies • Our proposal • Evaluation • Conclusion

  3. Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon Motivating Scenario: Mobile Coupon Distribution Mobile Coupon Distributer

  4. System Scalability Efficient Complex Queries Large amounts of Data High Throughput Multi-Dimensional Query Nearest Neighbors Query Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Current Location Coupon Coupon Coupon Motivating Scenario: Mobile Coupon Distribution 125,000,000 subscribers in Japan

  5. Existing Technologies at a reasonable price

  6. Buckets Index key00 key11 value11 value00 But, our target is multi-dimensional… key12 key01 value01 value12 key00 key11 key0X key1Y value0X value1Y keynn Latitude Time keynn valuenn Longitude Ordered Key-Value Stores Sorted by key Good at 1-D Range Query ex. BigTable HBase

  7. key11 key00 value11 value00 key01 key12 value12 value01 key0X key1Y value0X value1Y Naïve Solution: Linearlization Projects n-D space to 1-D space Apply a Z-ordering curve… key00 key11 keynn keynn valuenn Simple, but problematic…

  8. 9 2 Problem: False positive scans • MD-query on Linearized space • Translate a MD-query to linearized range query. • Ex. Query from 2 to 9. • Scan queried linearized range. • Filter points out of the queried area. • ex. blue-hatched area (4 to 7) Require the boundary information of the original space.

  9. Our Approach: MD-HBase • Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase Multi-Dimensional Index Single Dimensional Index

  10. Introduce Multi-dimensional Index • Multi-dimensional Index (ex. The K-d tree, The Quad tree) • Divide a space into subspaces containing almost same # of points • Organize subspaces as tree Efficient subspace pruning → to avoid false positive scans Organize as Divide into

  11. Space Partition By the K-d tree Binary Z-ordering space Partitioned space by the K-d tree bitwise interleaving ex. x=00, y=11 → 0101 11 10 01 00 11 10 01 00 00 01 10 11 00 01 10 11 How do we represent these subspaces?

  12. *→0 *→1 1000 1111 (10, 00) (11, 11) Left-bottom corner Right-top corner Key Idea: The longest common prefix naming scheme Subspaces represented as the longest common prefix of keys! • Remarkable Property • Preserve boundary informationof the original space 11 10 01 00 1*** 00 01 10 11 000* 1***

  13. Build an index with the longest common prefix of keys Buckets 000* Index 11 10 01 00 000* 001* 01** 001* 1*** 01** 01** 1*** 000* 001* 1*** 00 01 10 11 allocate per subspace

  14. Multi-dimensional Range Query Scan 0010 -1001 on the index 000* Index Subspace Pruning 11 10 01 00 001* Scan Filter 01** 10** Scan 00 01 10 11 11** Reconstruct the boundary Info. & Check whether intersecting the queried area

  15. K Nearest Neighbors Query • The best first algorithm can be applied. • the most efficient technique in practical case • Check the detail in our paper 5 4 3 1 2

  16. Variations of Storage Layer • Table Share Model • Uses single table, Maintain bucket boundary • Most space efficiency • Bucket co-location may cause disk access congestions • Table per Bucket Model • Allocates a table per bucket • Most flexible mapping • One-to-one, one-to-many, many-to-one • Bucket split is expensive • Copy all points to the new buckets. • Region per Bucket Model • Allocates a region per bucket • Most bucket split efficiency • Asynchronous bucket split • Requires modification of HBase

  17. Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses 10~100 timesfaster than others and responses proportional time to selectivity. Experimental Results: Multi-dimensional Range Query

  18. Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000 Experimental Results: kNearest Neighbors Query

  19. Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead. Experimental Results: Insert

  20. Conclusions • Designed a scalable multi-dimensional data store. • Scalability & Efficient multi-dimensional queries • Key Idea: indexingthe longest common prefix of keys • Easily extend general ordered key-value stores. • Demonstrated scalable insert throughput and excellent query performance. • Range Query: 10-100 times faster than existing technologies. • kNN Query: 1.5 s when k ≦ 100. • Insert: 220K inserts/sec on 16nodes cluster without overhead Thank you. Any Questions?

More Related