270 likes | 292 Views
This research paper discusses a technique for extracting maximal empty rectangles in large data sets, focusing on computational geometry and machine learning to optimize queries. The method involves finding all maximal 0-rectangles efficiently within a 0-1 matrix representation. The study includes practical implementation of the algorithm and scalability considerations. Key elements covered include time complexity, space utilization, and practical applications of the method in solving real-world problems. Additionally, the paper delves into query optimization and experimental results, showcasing improvements in performance through query rewrite techniques.
E N D
Mining for Empty Rectangles in Large Data Sets Jeff Edmonds Jarek Gryz Dongming Liang Renee Miller
A B 3 6 1 7 3 8 1 2 3 0 0 1 6 1 0 0 7 0 0 1 8 Matrix representation A,B(R S)
al um 0 A B 3 6 0 1 7 0 3 8 1 2 3 0 0 0 0 1 6 0 0 1 0 0 7 0 0 0 0 1 8 Find All Maximal 0-Rectangles A,B(R S)
Car Year … Example A,B(R S) 95 96 97 0 0 0 0 1 BMW Z3 1 0 0 Honda L2 0 0 1 Toyota 6A First BMW Z3 series cars were made in 1997.
Find all maximal empty rectangles between points in real plane O( (# 1’s)2 ) within a 0-1 matrix O( #0’s ) Machine Learning Computational Geometry Query Optimization Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] Problem: Purpose: # of maximal 0-rectangles:
O( # 1’s log(#1’s) + # rectangles ) = O(|X||Y|) O( #0’s ) = O(|X||Y|) O(|X||Y|) O(min(|X|, |Y|)) only two rows of matrix kept in memory Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] Time: Space:
Intensive random memory access • Requires a single scan of the sorted data IBM paid us $25,000 to patent it! Scales Badly • Scales well wrt • # of tuples in join • # of maximal rectangles • # of values |X| & |Y| Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] Practical Implementation: Scalable: Practical?
First Third Second Fourth Structure of Algorithm • loop y = 1..|Y| loop x = 1..|X| • Construct staircase(x,y) • Output all maximal 0-rectangles with <x,y> as bottom-right corner 1 X Y 1 Timing O(1) amortized time per <x,y> 1 0 0 1 1 <x,y> * 1
Fifth Structure of Algorithm • loop y = 1..|Y| loop x = 1..|X| • Construct staircase(x,y) • Output all maximal 0-rectangles with <x,y> as bottom-right corner 1 X Y 1 Query Optimization & Experimental Results 1 0 0 1 1 <x,y> * 1
Staircase(x,y) 1 ( x ,y ) r r 1 Stack of steps step 0 0 1 1 0 0 0 1 0 ( x ,y ) ( x ,y ) ( x ,y ) ( x ,y ) ( x ,y ) 5 4 1 2 3 4 1 5 2 3 1 0 1 0 0 0 0 Jarek Gryz: Staircase(x,y) 1 Y 1 <x,y> * X
Jarek Gryz: Constructing Maximal Rectangles <x,y> *
Jarek Gryz: Constructing Maximal Rectangles Too Narrow Maximal Too short <x,y> *
0 <x,y> * Jarek Gryz: Constructing staircase(x,y)from staircase(x-1,y) 1 1 0 Case 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 <x-1,y> * 1 0 1 0 0 0 0
0 <x,y> * Jarek Gryz: Constructing staircase(x,y)from staircase(x-1,y) 1 Case 2 1 1 1 0 1 0 1 0 0 0 0 1 0 <x-1,y> * 1 0 1 0 0 0 0
Delete Keep 0 <x,y> * Jarek Gryz: Constructing staircase(x,y)from staircase(x-1,y) 1 Too Narrow Maximal Too short ( x ,y ) r r 1 1 Y 1 1 0 0 1 0 0 0 0 0 1 0 ( x ,y ) 1 1 <x-1,y> * 1 0 ( x, y ) 1 0 0 0 0 X
y*(x-1,y) Jarek Gryz: Constructing x*(x,y) & y*(x,y) 1 ( x ,y ) r r 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 ( x ,y ) 1 1 <x-1,y> * 1 0 ( x, y ) x*(x-1,y) 1 0 0 0 0
y*(x,y) 0 <x,y> * 0 Query x*(x,y) 0 Jarek Gryz: Constructing x*(x,y) & y*(x,y) from x*(x-1,y) & y*(x,y-1) 1 ( x ,y ) r r 1 y*(x,y-1) 1 1 (saved) 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 ( x ,y ) 1 1 <x-1,y> * 1 0 ( x, y ) x*(x-1,y) 1 0 0 0 0
Third Structure of Algorithm • loop y = 1..|Y| loop x = 1..|X| • Construct staircase(x,y) • Output all maximal 0-rectangles with <x,y> as bottom-right corner 1 X Y 1 Timing O(1) amortized time per <x,y> 1 0 0 1 1 <x,y> * <x.y> 1
Jarek Gryz: Timing Only work that is not constant Time Delete 1 Too Narrow Maximal Too short ( x ,y ) r r 1 1 Y 1 1 0 0 0 1 0 0 0 0 0 1 0 ( x ,y ) 1 1 <x,y> * 1 0 ( x, y ) 1 0 0 0 0 X
1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 <x-1,y> * 1 0 1 0 0 0 0 Amortized # of steps deleted (per <x,y>) = # of steps created (per <x,y>) £ 1 Timing
Number of Maximal Rectangles £ # of maximal 0-rectangles: O( (# 1’s)2 ) [Namaad, Hsu, Lee] Running time of alg = O( #0’s ) £
How many empty rectangles are there? Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company.
Query rewrite: simple case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<60 and...
Query rewrite: complex case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and (… and …) or (… and …) or (… and …) or ...
Query optimization experiments real-world workload of 26 queries 5 of the queries “qualified” for the rewrite only simple rewrites were considered all rewrites led to improved performance