Boolean + Ranking: Querying a Database by K-Constrained Optimization

Boolean + Ranking: Querying a Database by K-Constrained Optimization Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang

Information retrieval Traditional databases Ranking query: Top 5 ranked by gpa Boolean query: dept = CS and year = 2 Many queries naturally combine Boolean and ranking Find top answers + B: dept = CS and year = 2 Qualifying constraint R: gpa Quantifying function Database applications on Web

Motivating scenarios • Data retrieval: • Find houses in certain price range with good price/sqrft ratio • Data analysis: • Find products with highest sale increase in consecutive years Selecth.addressfrom House h Whereh.price ≤ 200k ν h.price ≥ 400k Order byh.size/|h.price-300k|Limit1 Selecth.addressfrom House h, CrimeRate c Whereh.price ≤ 200k ν h.price ≥ 400k andh.zipcode = c.zipcode Order byh.size/|h.price-300k| *c.crimerate-1Limit10 Selectitemidfrom Sales s1, Sales s2 Wheres1.itemid = s2.itemid and s2.year – s1.year = 1 Order by s2.sale – s1.sale Limit10

Boolean + Ranking form a coherent goal function • Boolean B + Ranking R = Goal function G For a tuple t R(t) if B(t) is true 0if B(t) isfalse G(t) = B(t)*R(t) = (ie, lowest score)

G D The nature of Boolean + Ranking is K-constrained optimization query • Optimize goal function G over database D h.size/|h.price-300k| [h.price ≤ 200k ν h.price ≥ 400k ] Goal function G Database D

What is the query evaluation mechanism? Boolean query Ranking query + How to answer?

… … … D D B R Goal function G Current techniques lack of global search mechanism • If evaluated as separate operators • If search by an overall goal function G as a ranking function Boolean query B Ranking query R Boolean query B Ranking query R • Current techniques optimize only condition-by-condition • Current techniques restrict G to be monotonic

G OPT* D D Our thesis:Evaluate query as its nature suggests! Function optimization of G Optimize G over D Discrete state search over D

We view compound index as discrete space

b1 0-250 250-600 b2 b3 0-100 100-250 250-350 350-600 b6 b7 ……… 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1 We view compound index as discrete space Price (k) 600 1 350 2 5 250 4 3 100 6 size 1500 3000 4000 4500

We view compound index as discrete space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 M22 M32 M23 M33 ……… 250 2 5 1 4 3 … 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1

M22 M32 M23 M33 We view compound index as discrete space conceptually, combined space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 ……… 250 2 5 1 4 3 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1

How to perform the search in the space? • What is the search mechanism? • How to conceptually view the index space of D for search • How to guide the search? • How to use function G to focus the search

Challenge 1: What is the search mechanism?

We encode as A* because it’s optimal • What A* is: Finding the shortest path • Why we choose: Completeness and optimality with proper heuristics • Complete: guarantee to find shortest path • Optimal: visit least number of nodes origin 3 5 1 5 2 1 7 9 6 destination

Encoding our problem into shortest path is challenging • How to encode: • a tuple  a path? • score of tuple distance of path?

Therefore, we encode K-constrained opt. as: • How to encode a tuple to a path? • Adding a virtual target t* only reachable through tuples • How to encode maximal tuple with minimal path? • Quality of path depends solely on the tuple it passes by • For tuple state t D(t, t*) = - G(t) • For two states r, u D(r, u) = 0 M11 0 0 M22 M32 M23 M33 0 0 … M66 M67 M76 M75 M56 M77 M55 0 0 4 2 5 1 - G(1) - G(4) t*

Challenge 2: How to guide the search?

We use function opt. to sketch the landscape of G • Function optimization measures quality of states • Function optimization enables: • 1. How to define heuristics? • 2. How to configure space? • 3. Where to start the search?

1. Define admissible heuristics: Measure tightest upper bound • To guarantee completeness • A* requires admissible heuristics, ie, estimate optimistically • To ensure admissible heuristics • Function optimization gives tightest upper bound • Analytical approaches • Numeric analysis package H(region) = OPTMAX(G, region) ie, maximal value of G in the region

M22 M32 M23 M33 2. Configure descending space: disconnect uphills • To guarantee optimality • A* requires descending heuristics • To ensure descending heuristics • Remove uphill links M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1

M22 M32 M23 M33 Find right start point: Start from local optima • To guarantee correctness • Every tuple state must be reachable from start states • Taking only downhills requires start with high points • To ensure reachability • Initial states should contain all local optima M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1

Putting together: Executing A* on the configured space top-down M11 M22 M32 M23 M33 … M67 M76 M57 M55 M75 M56 M66 M77 4 2 5 1 • Search is implemented as priority queue driven traversal

Putting together: Executing A* on the configured space top-down M11 • Bottom-up approach is always better than top-down M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 4 2 5 1 bottom-up M11 M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 2 5 1 4

Experiments • Comparison vs. • Boolean then ranking • Ranking then boolean • Metrics: node accessed = Nl + Nt • Settings: • Benchmark queries over real dataset • Controlled queries over synthetic dataset

BR_clustered BR_unclustered OPT* Benchmark queries • Datasets: • 19,706 real estate listing crawled online • Queries • Q1: size * bedrms/| price-450k| : [40k<=price<=50k] • Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] • Q3: size/price : [bedrms=3 ν bedrms=4] Q1 Q2 Q3

Controlled queries • Datasets • Three randomly generated datasets of 100k points • Uniform, gaussian, logvariatenormal • Queries • Linear average queries: (eg, 0.4*a + 0.6*b) • Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) • Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)

Conclusion • Problem • Study K-constrained optimization queries as boolean + ranking • Abstraction • Encode K-constrained optimization into shortest path problem • Framework • Develop OPT* to process K-constrained optimization

Thank you! Questions?

How to implement function optimization? • How do we compare with RankSQL? • If bottom-up is always better, why consider top-down • Computing upper bound for each region is costly • Random vs. sequential I/O • Assuming indices on every attribute? • Materialize state space for every query? • Exponential number of states when attribute grows • Not every attribute has index on it • Selective choose the right index (attribute) to use • We do perform experiment to study how the system scale with #attr • Your algorithm is not optimal because you change the space

Boolean + Ranking: Querying a Database by K-Constrained Optimization

Boolean + Ranking: Querying a Database by K-Constrained Optimization

Presentation Transcript

ESI 4313 Operations Research 2

Multi-objective Optimization Using Particle Swarm Optimization

Chapter 4

Particle Swarm Optimization (PSO) Algorithm and Its Application in Engineering Design Optimization

A Survey of Differentially Constrained Planning

Basic Optimization Training

Combinational Circuits

Querying Sensor Networks

GPU Optimization using CUDA Framework

Optimization Order Entry Dispatch

C66x Code Optimization

Constrained Conditional Models for Natural Language Processing

Querying and Mining Data Streams: You Only Get One Look A Tutorial

Chapter 4 – Decisions

Code Optimization

PathCase A Web-Based Exploratory Querying and Visualization Tool for Biological Pathways

Querying Encrypted Data

Querying the Web of Data with SPARQL and XSPARQL

Chapter 7 Functions of Several Variables

Chapter 2-OPTIMIZATION

Introduction Boolean Equations Boolean Algebra From Logic to Gates Multilevel Combinational Logic

Querying Encrypted Data