200 likes | 297 Views
An Improved Indexing Scheme for Range Queries. Yvonne Yao Adviser: Professor Huiping Guo. Database-as-a-Service. Business organizations handle a large amount of data (TB) Cost of managing and maintaining these data onsite is high DAS DBMSs outsourcing
E N D
An Improved Indexing Scheme for Range Queries Yvonne Yao Adviser: Professor Huiping Guo
Database-as-a-Service • Business organizations handle a large amount of data (TB) • Cost of managing and maintaining these data onsite is high • DAS • DBMSs outsourcing • Clients rely on service providers for data management and maintenance • Cost is a lot lowered. • But…
Database-as-a-Service • Security of data is not guaranteed • Service providers are untrusted • Store only an encrypted form of data onto the remote server • Only users with the correct key(s) can have access • How then can we query the encrypted data? • Retrieve and decrypt the entire table, and apply SQL statements on it. Too expensive! • A more realistic approach was discovered
Bucketization • Various approaches to build meta-data: B+-tree based, hash-based, and bucket-based • What is bucketization? • Partition of attribute data into several buckets • Each bucket is identified by an ID • Bucket IDs are stored, along with encrypted data, on the remote server • Client keeps partition information as meta-data • General bucketization approach • Equi-width • Equi-depth
Example 1 • User query: SELECT * FROM grades WHERE gpa < 3.0 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ • Size of superset is 29, of which 7 of them are false positives
Query Optimal Bucketization • General idea: minimizing the bucket cost of each bucket • Input: <D = (V, F), M> • V = {v1, v2, v3, …, vn} where v1 < v2 < v3 < … <vn • F = Frequency of each value • M = Number of buckets to fill • Output: a matrix indicating the boundary of each bucket
Query Optimal Bucketization • QOB • Finds optimum solutions to two smaller sub-problems • one contains the leftmost M-1 buckets covering the (n-i) smallest points • Another contains the rightmost single bucket covering the remaining i points V = {v1, v2, v3, v4, v5, v6, …, vn-3, vn-2, vn-1, vn} n-i points go to last i points go to M-1 buckets last bucket
Example 2 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ • Same as the general bucketization method • In most cases, QOB can outperform the conventional bucketization strategy, but not always
Deviation Bucketization • Built upon QOB, takes the same parameters • Has two levels of buckets • First level: same as those produced by QOB • Second level: bucketization of deviation values, the difference between the value itself to the average of the bucket • Each first-level-bucket has at most M second level buckets • QOB has at most M buckets, while DB has at most M2 buckets
Deviation Bucketization • DB • Run QOB (D, M) • Construct First-Level-Buckets from boundary matrix • For each First-Level-Bucket • Initialize empty datasets vi’ and fi’ • For each vi in the bucket • vi’ = vi’∪vi’ – avg() • fi’ = fi’ ∪ 1 • Create a new dataset di = (vi’, fi’) • Run QOB(di, M)
Example 3 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3_1’ OR gpaID = ‘Bucket_3_2’ • In this case, no false positives are returned • Generally, false positives will still be returned, just the number of them will be greatly reduced
Experiments • Two datasets • Synthetic dataset: 105 integers from [0, 999] • Real dataset: 103 data points from the Aspect column of the Forest CoverType databasein UCI’s KDD Archive • Two sets of queries • Qsyn • Qreal