550 likes | 748 Views
Reverse Nearest Neighbor Aggregates. Over Data Streams. Flip Korn, S. Muthukrishnan and Divesh Srivastava. VLDB 2002. Alexander Izbinsky. 1. Background. RNN(q) – returns a set of data points that have the query point q as the nearest neighbor. Advanced database applications:
E N D
Reverse Nearest Neighbor Aggregates Over Data Streams Flip Korn, S. Muthukrishnan and Divesh Srivastava. VLDB 2002 Alexander Izbinsky 1
Background • RNN(q) – returns a set of data points thathave the query point q as the nearest neighbor. • Advanced database applications: • fixed wireless telephone access application –“load” detection problem:count how many users are currently using a specific base station q if q’s load is too heavy activating an inactive base station to lighten the load of that over loaded base station • Asymetric Property • The Nearest Neighbor Relation is not symmetric, the set of points that are closest to a query point (i.e., the Nearest Neighbors) differs from the set of points that have the query point as their Nearest Neighbor (called the Reverse Nearest Neighbors) 2
p r q Nonsymmetrical Property of RNN Queries • NN(q) = p NN(p) = q • If p is the nearest neighbor of q, then q need not be the nearest neighbor of p (in this case the nearest neighbor of p is r). • those efficient NN algorithms cannot directly applied to solve the RNN problems. Algorithms for RNN problems are needed. • A straight forward solution:-- check for each point whether it has q as its nearest neighbor -- not suitable for large data set! 3
Two Versions of RNN Problem • Bichromatic Version: • the data points are of two categories, say red and blue. The RNN query point q is in one of the categories, say blue. So RNN(q) must determine the red points which have the query point q as the closest blue point. • e.g. fixed wireless telephone access application: clients/red (e.g. call initiation or termination) servers/blue (e.g. fixed wireless base stations) • Monochromatic Version: • all points are of the same color is the monochromatic version. 4
Introduction • RNN queries have been studied for finite, stored data sets • RNN can identify "influence" of a data point on the database • [F. Korn and S. Muthukrishnan, Influence Sets Based on Reverse Nearest Neighbor Queries] • [I. Stanoi, M. Riedewald, D., Mirek Riedewald, D. Agrawal, A.E. Abbadi, Discovery of influence sets in frequently updated databases] • [C. Yang, King-Ip Lin, An index structure for efficient reverse nearest neighbor queries ] 5
Determining the Influence Set • Finding the set of customers affected by the opening of a new store outlet location • Notifying the subset of subscribers to a digital library who will find a newly added document most relevant • Finding set of users whose profiles are more similar to the new service offering than to any other service The interest is not the exact RNN set, But aggregates on this set - RNNA ! 6
RNNA Application 1 Fixed Wireless Telephony Access • Fixed Physical Position • Defined Coverage Area • Calls Arrives in Streams • Worst-Case “Signal Strength” – RNN MAXDIST • “Load” on Base Station – RNN COUNT • Optimization RNNA problems 7
RNNA Application 2 Highway Traffic Monitoring • Fixed Physical Position • Detect vehicles, estimate speed and length • User Queries Arrives in Streams • Periodic Updates of Closest Sensor • “Load” on Sensor – RNN COUNT • “Accuracy” of Information – RNN MAXDIST • Optimization RNNA problems 8
RNNA Computations • Max-RNNA – Given K servers, return the maximum RNNA over all clients to any of the servers • List-RNNA – Given K servers, return the RNNA over all clients to each of the servers • Opt-RNNA – Find a set of at most K servers for which their RNNAs are below a given threshold Exact computation is not possible 9
RNNA Approximations • Max-RNN-Count • Insertion and Deletion – 3-approximation • Insertion only – (1+) -approximation • Max-RNN-MAXDIST • (1+) -approximation • List-RNN-COUNT & List-RNN-MAXDIST • Lower- & Upper-bound as function of the true counts • Opt-RNN-COUNT • 8-approximation • Opt-RNN-MAXDIST • (1+) –approximation Space – near-linear in the number of available servers 10
Related Works • No previous works for RNNA over Data Streams • Algorithms over Data Streams • Algorithms for computing RNN over a conventional DB 11
Algorithms over Data Streams • Space requirements of Selection and Sorting as a function of the number of passes over the data • [J. I. Munro and M. S. Paterson. Selection and Sorting with Limited Storage] • Formalization of the Data Stream Model • [A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries] and [M. R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams] 12
Algorithms over Data Streams • Computing the approximate median and other quantiles in a single pass over data set • [R. Agrawal, A. Swami, A One-Pass Space-Efficient Algorithm for Finding Quantiles] • [G.S. Manku, S. Rajagopalan, B.G. Lindsay. Approximate Medians and other Quantiles in One Pass and with Limited Memory] • [G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets] • [M. Greenwald and S. Khanna. Space- Efficient Online Computation of Quantile Summaries] 13
Algorithms over Data Streams • Computing the approximate online quantiles with probabilistic guaranties over data stream • [A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles] • Histogram construction over data stream • [A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance ] 14
Algorithms over Data Streams • Maintaining summary structures for maintaining approximate aggregates over data stream • [A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries] and [M. R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams] • [J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams] 15
Algorithms over Data Streams Mining Data Stream • Construction of decision trees • [P. Domingos, G. Hulten. Mining High-Speed Data Streams] • [J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh. BOAT Optimistic Decision Tree Construction] • Association rules • [C. Hidber. Online Association Rule Mining] • Similarity matching • [G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using Hamming Norms] 16
Algorithms over Data Streams Mining Data Stream • Clustering algorithms (k-median clustering problem) • [M. Charikar, C. Chekuri, T. Feder, R. Motwani. Incremental Clustering and Dynamic Information Retrieval ] • [S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. Clustering Data Streams] 17
Algorithms over Data Streams Dynamic Maintenance • Lp norms • [P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation] • Hamming norms • [G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using Hamming Norms] • Quantiles • [A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles] • Sliding window • [M. Datar. Maintaining Stream Statistics over Sliding Windows ] 18
Algorithms for computing RNN over a conventional DB • Study of RNN in data bases • [F. Korn and S. Muthukrishnan, Influence Sets Based on Reverse Nearest Neighbor Queries] • Efficient access methods for indexing RNN • [I. Stanoi, M. Riedewald, D., Mirek Riedewald, D. Agrawal, A.E. Abbadi, Discovery of influence sets in frequently updated databases] • [C. Yang, King-Ip Lin, An index structure for efficient reverse nearest neighbor queries ] 19
Problem Definition Collection of n available servers (not necessary active) li – location of server i Clients arrive and depart Lj– location of client j RNN of server iis the set of all clients that have i as their NN server 20
Instances of Aggregates • RNN-COUNT(i) – number of clients currently in the system for which i is the NN – “LOAD” for active servers • RNN-MAXDIST(i ) – largest distance to a client that has i as its NN – “QUALITY” for active servers • Streams of clients are large – can’t be stored in memory – computing approximate RNNA values 21
Focus of the Problem • Max-RNNA – Given K active servers, return the maximum RNNA over all clients to their closest active server – “Worst-case Load” or “Quality” • List-RNNA – Given K active servers, return a list of the RNNA over all clients to each of the K active servers - “Maximum Load” or “Worst-case Quality” • Opt-RNNA – Find a set of at most K servers from the available ones to be active, for which their RNNAs are below a given threshold – “Optimization” 22
Algorithm Assumption: Servers are on as straight line Counters for servers i, j and client k: CLij -> Lk[li, (li+lj)/2) CRij -> Lk((li+lj)/2, lj] 23
Algorithm for RNN-COUNT ( i ) The algorithm: Let l be the closest active server from the left of i and r from the right. RNN-COUNT(i) = CLil + CRir Require O(n2) space O(n2) updates We want – space near-linear and less updates Approximation is needed 24
Data Structure Definitions: s1,.. sk are the K servers designated to be active Assumption: Servers are sorted l1 … ln Counter number of clients for server i: C(i) -> Lk[li, li+1) – at the right side of server i C(0) – at left side of server 1 Require: O(n) space O(log n) updates (look for wanted server) 25
Answering Queries Max-RNNA (s1,.. sk) Max-RNNA(s1,.. sk) = maxi RNN-COUNT(si) 26
RNN-COUNT(s0) RNN-COUNT(s1) C(0) C(1) C(2) C(3) C(4) 1 2 3 4 J< J>+1 Example Max-RNNA (s1,.. sk) 27
Answering Queries List-RNNA (s1,.. sk) Mi for each si The Proof is similar to previous theorem 29
Answering Queries Opt-RNNA • Greedy Algorithm finds: • Minimal Number of active servers – K • maxi RNN-COUNT(si)C 30
Answering Queries Opt-RNNA 31
Opt-RNNA 32
Opt-RNNA "Dual" Problem Minimize maxi RNN-COUNT(si) Given upper bound on number of servers K • Algorithm • Choose different values of C • Run Greedy Algorithm of Opt-RNNA • Repeat until solve with number of servers K*K 33
Insert-Only Clients Data Structure Assumption: Servers are sorted l1 … ln Counter number of clients for server i: C(i) -> Lk[li, li+1) – at the right side of server i C(0) – at left side of server 1 Count Partitioning Maintain l-quantiles (Greenwald & Khanna) ci1…cil – number of clients lying in [li, Lcik] Within (1)kC(i)/l, where 1k l Require: O(logC(i)/) space 34
Answering Queries Max-RNNA (s1,.. sk) Max-RNNA(s1,.. sk) = maxi RNN-COUNT(si) 35
Insert-Only Clients List-RNNA (s1,.. sk) Implementation in the same way Opt-RNNA Maintenance of data structure for deletion ? 37
Algorithm for RNN-MAXDIST ( i ) The algorithm: Histogram based on space partitioning Assumption: Servers are sorted l1 … ln Exponential sized buckets Domain size U, such that U = [min(Lj,li), max(Lj,li)] Dividers between servers i and (i+1) – gij at distance (1+ )j from li Number of dividers is O(log1+ [li+1-li]) 38
Data Structure Counter number of clients between gik and gik+1 is #gik • For updates of client j: • Find i such that Lj[li, li+1) • Find k such that Lj[gik , gik+1) • Update value #gik Require O(n log1+ U) space O(log1+ U) updates 39
Answering Queries Max-RNNA (s1,.. sk) Max-RNNA(s1,.. sk) = maxi RNN-MAXDIST(si) 40
Max-RNNA (s1,.. sk) Details of the proof will be given in the future paper. 41
Answering Queries List-RNNA (s1,.. sk) Di=max{RDi,LDi} for each si The Proof is similar to previous theorem 42
Answering Queries Opt-RNNA • Greedy Algorithm with limited backtracking finds: • Minimal Number of active servers – K • maxi RNN-MAXDIST(si)D 43
Opt-RNNA The proof will be given in the future paper. 44
Opt-RNNA "Dual" Problem Minimize maxi RNN-MAXDIST(si) Given upper bound on number of servers K • Algorithm • Choose different values of D • Run Greedy Algorithm of Opt-RNNA • Repeat until solve with number of servers K*K 45
Let the space around a query point q be divided into six equal regions Si (1<=i<=6) by straight lines intersecting q. Si therefore is the space between two space dividing lines. For a given 2-dimensional dataset, RNN(q) will return at most six data points. And they are must be on the same circle centered at q. L3 L2 s2 s3 s1 q L1 s4 s6 s5 Extensions Nearest Neighbor and Reverse Nearest Neighbor Queries for Moving Objects R.Benetis, C.S.Jensen,G.Karciauskas, S.Saltenis Reverse Nearest Neighbor Queries for Dynamic Databases SHOU Yu Tao Assumption: the clients are on the same axis as the servers 46
Experiments The following aspects were tested: Experimental data: CALIFORNIA – latitude of 63k buildings in California, uniform and binomial distributions 48
Average Error of List-RNN-Count Test AVG i ( C^i/Ci ) 49
Average Error of List-RNN-Maxdist Test AVG i ( Di/D^i ) 50