An Improved Indexing Scheme for Range Queries

An Improved Indexing Scheme for Range Queries Yvonne Yao Adviser: Professor Huiping Guo

Database-as-a-Service • Business organizations handle a large amount of data (TB) • Cost of managing and maintaining these data onsite is high • DAS • DBMSs outsourcing • Clients rely on service providers for data management and maintenance • Cost is a lot lowered. • But…

Database-as-a-Service • Security of data is not guaranteed • Service providers are untrusted • Store only an encrypted form of data onto the remote server • Only users with the correct key(s) can have access • How then can we query the encrypted data? • Retrieve and decrypt the entire table, and apply SQL statements on it. Too expensive! • A more realistic approach was discovered

Database-as-a-Service

Bucketization • Various approaches to build meta-data: B+-tree based, hash-based, and bucket-based • What is bucketization? • Partition of attribute data into several buckets • Each bucket is identified by an ID • Bucket IDs are stored, along with encrypted data, on the remote server • Client keeps partition information as meta-data • General bucketization approach • Equi-width • Equi-depth

Example 1

Example 1 • User query: SELECT * FROM grades WHERE gpa < 3.0 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ • Size of superset is 29, of which 7 of them are false positives

Query Optimal Bucketization • General idea: minimizing the bucket cost of each bucket • Input: <D = (V, F), M> • V = {v1, v2, v3, …, vn} where v1 < v2 < v3 < … <vn • F = Frequency of each value • M = Number of buckets to fill • Output: a matrix indicating the boundary of each bucket

Query Optimal Bucketization • QOB • Finds optimum solutions to two smaller sub-problems • one contains the leftmost M-1 buckets covering the (n-i) smallest points • Another contains the rightmost single bucket covering the remaining i points V = {v1, v2, v3, v4, v5, v6, …, vn-3, vn-2, vn-1, vn} n-i points go to last i points go to M-1 buckets last bucket

Example 2

Example 2 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ • Same as the general bucketization method • In most cases, QOB can outperform the conventional bucketization strategy, but not always

Deviation Bucketization • Built upon QOB, takes the same parameters • Has two levels of buckets • First level: same as those produced by QOB • Second level: bucketization of deviation values, the difference between the value itself to the average of the bucket • Each first-level-bucket has at most M second level buckets • QOB has at most M buckets, while DB has at most M2 buckets

Deviation Bucketization • DB • Run QOB (D, M) • Construct First-Level-Buckets from boundary matrix • For each First-Level-Bucket • Initialize empty datasets vi’ and fi’ • For each vi in the bucket • vi’ = vi’∪vi’ – avg() • fi’ = fi’ ∪ 1 • Create a new dataset di = (vi’, fi’) • Run QOB(di, M)

Example 3

Example 3 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3_1’ OR gpaID = ‘Bucket_3_2’ • In this case, no false positives are returned • Generally, false positives will still be returned, just the number of them will be greatly reduced

Experiments • Two datasets • Synthetic dataset: 105 integers from [0, 999] • Real dataset: 103 data points from the Aspect column of the Forest CoverType databasein UCI’s KDD Archive • Two sets of queries • Qsyn • Qreal

Experiment 1

Experiment 2

Thank You

An Improved Indexing Scheme for Range Queries

An Improved Indexing Scheme for Range Queries

Presentation Transcript

Adaptively Parallelizing Distributed Range Queries

How range queries work

A Privacy-Preserving Index for Range Queries

An Improved Efficient Secret Handshakes Scheme with Unlinkability

Data Structures: Range Queries - Space Efficiency

Effectively Indexing Uncertain Moving Objects for Predictive Queries

Dynamic Skylines Considering Range Queries

A Privacy-Preserving Index for Range Queries

Range Queries in Distributed Networks

Data Structures for Orthogonal Range Queries

A Privacy Preserving Index for Range Queries

Improved Selective Acknowledgment Scheme for TCP

An improved LT encoding scheme with extended chain lengths

How range queries work

Spatio-Temporal Queries and Indexing

Indexing and Range Queries in Spatio- T emporal Databases

A Privacy – Preserving Index for Range queries

A Scheme for Visual Feature based Image Indexing

Towards Robust Indexing for Ranked Queries

Path-Hop: efficiently indexing large graphs for reachability queries

Towards Robust Indexing for Ranked Queries

Mercury: Scalable Routing for Range Queries