340 likes | 451 Views
An Efficient Multi-Dimensional Index for Cloud Data Management. Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin University of China. Outline . Motivation Query Answering on the Cloud Related Work
E N D
An Efficient Multi-Dimensional Index for Cloud DataManagement Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu XiaofengMeng School of Information Renmin University of China
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
Motivation • Cloud systems have been justified as brilliant for web search applications • Simple structure, mostly key-value pairs • Flexible, efficient for analytic work • However, they are insufficient for complex data management needs • No powerful language as SQL • Hard to process complex queries • Lack of efficient index structures
Distributed Cloud base? • BigTable How to query on other attributes besides primary key? • HBase
Motivation • As part of our Cloud-based DBMS project, we aim to build efficient index structure on the Cloud.
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
Query Answering in the Cloud Fast locating of relevant slave nodes Efficient lookup on each slave nodes
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
Related Work • S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. • H. chih Yang and D. S. Parker, “Traverse: Simplified indexing on large map-reduce-merge clusters,” in Proceedings of DASFAA 2009, Brisbane, Australia, April 2009, pp. 308–322. • M. K. Aguilera, W. Golab, and M. A. Shah, “A practical scalable distributed b-tree,” in Proceedings of VLDB’08, Auckland, New Zealand, August 2008, pp. 598–609.
Distributed Database • Data slicing in DDBS • Horizontal, vertical, etc. • Slice based on conditions • Check condition conflict on query processing • Data distribution on the Cloud is different and could be very complex if expressed as set of conditions • Condition check is too expensive
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
EMINC: Node Bounding • Node cube of a table on a slave node • Value range of table on this node Node Cube: (1,1), (6,10)
EMINC: Architecture Each leaf node corresponds to one node cube Use KD-Tree to maintain local index on slave nodes
EMINC: Query Processing • Get query cube of the query and look up in the R-Tree to get relevant data nodes • 1<x<2, 3<y<4 => Query Cube: (1,3),(2,4) Query Cube Query Cube No Yes Node Cube Node Cube
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
EMINC: Extended Node Bounding • Problem with single bounding • Bad performance for sparse node Many queries will be mislead to this node
EMINC: Cube Cutting Single Node Cube with Low Accuracy Multiple Node Cube with High Accuracy
EMINC: Cube Methods Random cutting Equal cutting Clustering-based cutting
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
EMINC: Index Update Strategy • Index update issues: • Cubes may invalidate themselves after certain data update, thus need reconstruction • Insertion invalidates cube • Create a node cube containing new data • For regular maintenance of index • Cost estimation based update strategy
EMINC: Cost Estimation Strategy • Cost of index update: • Recalculate cubes on local node • Transfer to master node and maintain R-Tree • Query performance will be affected • Benefit of index update: • More accurate query directing, less waste
EMINC: Two Phase Method • After one update: • Wait for a time period of deltaT • deltaT expires, check if an update is needed • DetermindeltaT • Check for update • Assumption : Number of queries to be processed Total size of node cubes of this node
EMINC: Phase One • After pervious update: • benefit = decrement-of-query/time* deltaT • We enjoy the benefit of pervious update for deltaT time period • cost = number-of-queries missed • Number of queries we could process if we use pervious update time to answer queries
EMINC: Phase Two • benefit > cost => deltaT • After deltaTexpires, check if an update is needed. This check involves following: • Record update frequency • Expected benefit ratio • Performance requirement • We leave this as future work
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
Evaluation • 6 machines • 1as master node • 5 slave nodes simulating 100~1000 nodes • Each machine had a 2.33GHz Intel Core2 Quad CPU, 4GB of main memory, and a 320G disk. • Machines ran Ubuntu 9.04 Server OS.
Outline • Motivation • Query Answering on the Cloud • Related Work • EMINC: Index the Cloud Efficiently • Node Bounding • Extended Node Bounding • Cost Estimation based Index Update • Evaluation • Conclusion & Future Work
Conclusion • In this paper we presented a series of approaches on building efficient multi-dimensional index on Cloud platform. • We developed the node bounding technique to reduce query processing cost on the cloud platform. • In order to maintain efficiency of the index, we proposed a cost estimation-based approach for index update.
Future Work • Complete cost estimation model • Take replication of datainto consideration • Implement in Hbase to further verify performance
Thanks Please visit our lab for more information:http://idke.ruc.edu.cn/