300 likes | 382 Views
Boolean + Ranking: Querying a Database by K-Constrained Optimization. Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang. Information retrieval. Traditional databases. Ranking query: Top 5 ranked by gpa. Boolean query:
E N D
Boolean + Ranking: Querying a Database by K-Constrained Optimization Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang
Information retrieval Traditional databases Ranking query: Top 5 ranked by gpa Boolean query: dept = CS and year = 2 Many queries naturally combine Boolean and ranking Find top answers + B: dept = CS and year = 2 Qualifying constraint R: gpa Quantifying function Database applications on Web
Motivating scenarios • Data retrieval: • Find houses in certain price range with good price/sqrft ratio • Data analysis: • Find products with highest sale increase in consecutive years Selecth.addressfrom House h Whereh.price ≤ 200k ν h.price ≥ 400k Order byh.size/|h.price-300k|Limit1 Selecth.addressfrom House h, CrimeRate c Whereh.price ≤ 200k ν h.price ≥ 400k andh.zipcode = c.zipcode Order byh.size/|h.price-300k| *c.crimerate-1Limit10 Selectitemidfrom Sales s1, Sales s2 Wheres1.itemid = s2.itemid and s2.year – s1.year = 1 Order by s2.sale – s1.sale Limit10
Boolean + Ranking form a coherent goal function • Boolean B + Ranking R = Goal function G For a tuple t R(t) if B(t) is true 0if B(t) isfalse G(t) = B(t)*R(t) = (ie, lowest score)
G D The nature of Boolean + Ranking is K-constrained optimization query • Optimize goal function G over database D h.size/|h.price-300k| [h.price ≤ 200k ν h.price ≥ 400k ] Goal function G Database D
What is the query evaluation mechanism? Boolean query Ranking query + How to answer?
… … … D D B R Goal function G Current techniques lack of global search mechanism • If evaluated as separate operators • If search by an overall goal function G as a ranking function Boolean query B Ranking query R Boolean query B Ranking query R • Current techniques optimize only condition-by-condition • Current techniques restrict G to be monotonic
G OPT* D D Our thesis:Evaluate query as its nature suggests! Function optimization of G Optimize G over D Discrete state search over D
b1 0-250 250-600 b2 b3 0-100 100-250 250-350 350-600 b6 b7 ……… 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1 We view compound index as discrete space Price (k) 600 1 350 2 5 250 4 3 100 6 size 1500 3000 4000 4500
We view compound index as discrete space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 M22 M32 M23 M33 ……… 250 2 5 1 4 3 … 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1
M22 M32 M23 M33 We view compound index as discrete space conceptually, combined space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 ……… 250 2 5 1 4 3 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1
How to perform the search in the space? • What is the search mechanism? • How to conceptually view the index space of D for search • How to guide the search? • How to use function G to focus the search
We encode as A* because it’s optimal • What A* is: Finding the shortest path • Why we choose: Completeness and optimality with proper heuristics • Complete: guarantee to find shortest path • Optimal: visit least number of nodes origin 3 5 1 5 2 1 7 9 6 destination
Encoding our problem into shortest path is challenging • How to encode: • a tuple a path? • score of tuple distance of path?
Therefore, we encode K-constrained opt. as: • How to encode a tuple to a path? • Adding a virtual target t* only reachable through tuples • How to encode maximal tuple with minimal path? • Quality of path depends solely on the tuple it passes by • For tuple state t D(t, t*) = - G(t) • For two states r, u D(r, u) = 0 M11 0 0 M22 M32 M23 M33 0 0 … M66 M67 M76 M75 M56 M77 M55 0 0 4 2 5 1 - G(1) - G(4) t*
We use function opt. to sketch the landscape of G • Function optimization measures quality of states • Function optimization enables: • 1. How to define heuristics? • 2. How to configure space? • 3. Where to start the search?
1. Define admissible heuristics: Measure tightest upper bound • To guarantee completeness • A* requires admissible heuristics, ie, estimate optimistically • To ensure admissible heuristics • Function optimization gives tightest upper bound • Analytical approaches • Numeric analysis package H(region) = OPTMAX(G, region) ie, maximal value of G in the region
M22 M32 M23 M33 2. Configure descending space: disconnect uphills • To guarantee optimality • A* requires descending heuristics • To ensure descending heuristics • Remove uphill links M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1
M22 M32 M23 M33 Find right start point: Start from local optima • To guarantee correctness • Every tuple state must be reachable from start states • Taking only downhills requires start with high points • To ensure reachability • Initial states should contain all local optima M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1
Putting together: Executing A* on the configured space top-down M11 M22 M32 M23 M33 … M67 M76 M57 M55 M75 M56 M66 M77 4 2 5 1 • Search is implemented as priority queue driven traversal
Putting together: Executing A* on the configured space top-down M11 • Bottom-up approach is always better than top-down M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 4 2 5 1 bottom-up M11 M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 2 5 1 4
Experiments • Comparison vs. • Boolean then ranking • Ranking then boolean • Metrics: node accessed = Nl + Nt • Settings: • Benchmark queries over real dataset • Controlled queries over synthetic dataset
BR_clustered BR_unclustered OPT* Benchmark queries • Datasets: • 19,706 real estate listing crawled online • Queries • Q1: size * bedrms/| price-450k| : [40k<=price<=50k] • Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] • Q3: size/price : [bedrms=3 ν bedrms=4] Q1 Q2 Q3
Controlled queries • Datasets • Three randomly generated datasets of 100k points • Uniform, gaussian, logvariatenormal • Queries • Linear average queries: (eg, 0.4*a + 0.6*b) • Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) • Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)
Conclusion • Problem • Study K-constrained optimization queries as boolean + ranking • Abstraction • Encode K-constrained optimization into shortest path problem • Framework • Develop OPT* to process K-constrained optimization
Thank you! Questions?
How to implement function optimization? • How do we compare with RankSQL? • If bottom-up is always better, why consider top-down • Computing upper bound for each region is costly • Random vs. sequential I/O • Assuming indices on every attribute? • Materialize state space for every query? • Exponential number of states when attribute grows • Not every attribute has index on it • Selective choose the right index (attribute) to use • We do perform experiment to study how the system scale with #attr • Your algorithm is not optimal because you change the space