Continuous Processing of Preference Queries in Data Streams : a Survey

Continuous Processing of Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle Universityof Thessaloniki

Presentation Layout • Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Summary

Data Streams • Data Stream is an infinite sequence of objects. • Each object can be one-dimensional or multi-dimensional. • Streaming Time Series are finite sequences of objects. • Streaming Time Series changes over time. • Arrival rate of objects usually varies.

Time expired active Sliding Window Model (1) • Count-based window: Sliding window contains the W most recent tuples(“active”). • Older tuples expire. W=5 t1 t2 t3 t4 t5 t6 t7 t8

Time expired active Sliding Window Model (2) • Time-based window: Sliding window contains the tuples(“active”) of the W most recent timestamps. • Older records expire. W=5 t6 t1 t2 t3 t4 t5 t8 t7

Query Result Query Result Database System User / Application Input

Query Result Continuous Evaluation in a Data Stream System User / Application Query processor

Motivation (1) • Numerous data stream contexts • Financial data analysis • Network management • Astronomical data analysis • Sensor network • Telecommunication data management

Motivation (2) • Preference queries • Useful decision support tool • Many applications in data streams Example 1 (telecommunication data) Report the clients with the maximum call time and the maximum number of calls. Example 2 (stock-market data) Report the products with the maximum price, the minimum sales and the minimum number of buyers. Continuous top-k dominating query Continuous skyline query

Presentation Layout • Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Conclusions

Skyline Query price Skyline: contains all the tuples not dominated by any other tuple. T1 T6 T2 T4 T5 T3 distance • Dominant tuple: A tuple t dominates another tuple t’ if • t is not worse than t’ in all dimensions, and • t is better than t’ in at least one dimension.

Continuous Skyline Query • Problem definition: We have to continuously evaluate a skyline query in multidimensional streaming time series. • Application example: network data • Computers with suspicious behavior. • Network traffic, number of connections, number of destinations.

Basic Idea • Skyline changes due • The insertion of a new skyline tuple. • The expiration of a skyline tuple. • LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] • Use of a spatial index • Advantage: simple implementation • Disadvantage: the expiration of a skyline tuple is not handled efficiently

Event Approach (1) • Existing skyline tuple expires: • How can we find new skyline tuples? • Very costly operation • Skyline influence time (SIT) • Minimum time in which a tuplemay become a skyline tuple. • Generate events based on SIT

Event Approach (2) A(1) J(10) F(6) W=10 • Eager [Tao, TKDE06] • Advantage: handles skyline expiration • Disadvantage: pro-cessingtime per tuple H(8) K(11) G(7) L(12) I(9) D(4) B(2) E(5) C(3) Tuple K can be discarded due to tuple L (younger and better) K.SIT=19

n-of-N Skyline Queries (1) n-of-N definition • S6 = {a,c} • S4 = {c,g} source: icde05

n-of-N Skyline Queries (2) n-of-N definition • S6 = {c,h} • S4 = {e,h} source: icde05

Method cnN(1) Method cnN [Lin, ICDE05] is also based on events Tuple K is redundant because tupleL is better and younger than K A(1) J(10) F(6) W=10 H(8) K(11) G(7) Tuple L is dominated by D and E. L(12) I(9) D(4) B(2) E(5) The dominance relation between L and E is critical because E is the youngest tuple which dominates L C(3)

Method cnN (2) Redundant tuples • Generate intervals • For the skyline tuples, e.g. C = (0,3] • For the critical dominance relations, C -> G = (3,7] • Use an interval-tree to store them A(1) B(2) G(7) Dominance graph contains all the critical dominance relations F(6) Critical dominance relation E(5) C(3) D(4)

Method cnN (3) • A tuple t is in the answer of an n-of-Nskylinequery iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far. M = 7 A(1) C = (0,3] stabbing query For n = 4, M–n+1 = 4 For n = 6, M–n+1 = 2 B(2) G(7) D = (0,4] F(6) C -> G = (3,7] E(5) D -> E = (4,5] S4 = {D, G} S6 = {C, D} C(3) D -> F = (4,6] D(4) To answer a n-of-N query, apply a (M–n+1) stabbing query

Method cnN (3) • Advantages • Good use of skyline properties • Multiple query processing • Disadvantages • Processing time per tuple • Increased memory requirements

Frequent Skyline - Motivation • Highly dynamic environment • The skyline results are meaningful only if the skyline tuples appear consistently • Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]

Streaming Model • Client/Server architecture • Server receives object updates from the clients. • Each object can be represented as a d-dimensional point. • Object update (point movement in the d-dimensional space). • at least a value in one dimension changes • Object insertion or deletion • Point movement from/to a nonexistent position • Minimization of communication cost

Filter • Safe region technique • Skyline remains unchanged if each object stays in a safe region • Communication happens only when the safe region is violated • Safe region approach leads to communication optimization An object as a point and its filter (safe region) source: sigmod09

Sampling • All clients report their skyline at the same sampled time • The clients are synchronized with the same random seed • Guaranteed quality if sampling rate is high enough

Hybrid • Hybrid solution • Combines Filter and Sampling • Small changes: apply Filter • Larger changes: apply Sampling • Disadvantage of all three methods • energy consumption is not uniform (critical in sensor networks)

k-dominant Skyline Query - Μotivation Skyline: contains tuplesnot dominated by any other tuple. Disadvantage: High dimensionality problem. Solution: Relax the notion of dominance. • k-dominant tuple: A tuple t k-dominates another tuple t’ if • t is not worse than t’ in at least k dimensions and • t is better than t’ in at least one of them. k-dominant skyline: contains all tuplesnot k-dominated by any other tuple[Kontaki, SAC08]

k-dominant Skyline Query - Εxample T1 4-dominates T3 T1 5-dominates T4 T1 dominates T5 Smaller k, less tuples in k-dominant skyline Conventional skyline {T1, T2, T3, T4} 4-dominantskyline {T1, T2} 5-dominantskyline {T1, T2, T3}

Observations • Traditional or streaming skyline methods are inappropriate • Skyline properties do not hold • E.g. transitive property • k-dominance can be cyclic • Existence of multiple users and multiple queries.

Method CoSMuQ (1) • A query on D dimensions arrives. • Given a parameter value k, split the query to subqueries of d=k dimensions. • Compute the conventional skyline of each subquery. • The k-dominant skyline is the intersection of the skylines of the subqueries of a query.

Method CoSMuQ (2) • Advantages • Based on conventional skyline (simple domination checks) • Properties of conventional skylines can be used • Exploits the overlap between different queries. • Disadvantages • Memory requirements increase in high dimensionality.

Continuous Skyline methods - Summary

Presentation Layout • Data streams - Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Summary

Top-k query - Εxample Given a preference function, a top-k query returns the k tuples with thebest scores. price T1 T6 T2 T4 T5 k=1 k=2 F=price+distance T3 distance

Continuous Top-k Query • Problem definition: Continuous evaluation of top-k query in multidimensional streaming time series. • Application Example: network data • top-100 flows with the largest individual throughput • Common destination • DDoS attack

Basic Idea • New tuple changes the top-k • Should belong in the influence region of the query • Top-k tuple expiration • From scratch query computation • TMA (Top-k Monitoring Algorithm) [Mouratidis, SIGMOD06] • Advantage: simple implementation • Disadvantage: no efficient handling of an expired top-k tuple Line defined by the F = score(tk) = x1 + x2 x2 tk Influence region x1 source: sigmod06

Skyband - Example 1-skyband (tuples not dominated by other tuples) 1-skyband is the skyline 2-skyband (tuples dominated by at most 1 other tuples) A B D Dominated by 2 other tuples (3-skyband) C E k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.

SkybandApproach (1) Transform tuples in the (score,expiration_time) space original space transformed space F=price+distance price score top-1 DC=0 T1 T6 T6 T2 T4 DC=1 T4 T5 DC=0 T2 DC=1 T1 T5 DC=1 T3 DC=0 T3 exp_time distance Rule: Keep tuples with DC < k Dominance counter (DC): number of tuples that are younger and better Observation: tuplesappearinginsometop-kresultbelongtothe k-skybandinthe(score,exp_time)space.

SkybandApproach (2) • SMA (Skyband Monitoring Algorithm) proposed in [Mouratidis, SIGMOD06] • Advantage: independent of the dimensionality • 2-dimensional space (score-exp_time) • Disadvantage: • k-skyband may contain less than k tuples • In this case, a top-k tuple expiration will cause query computation from scratch

Distributed Top-k • Continuously report the k largest values obtained from distributed data streams. • Objective is to minimize communication cost • Proposed by [Babcock, SIGMOD03]

Streaming Model • Nodes: N1, N2 , … , Nm, coordinator node: N0 • Set of n data objects O1, O2 , … , Onassociated with real values V1, V2 , … , Vn • Value updates are represented as <Oi, Nj, > tuples: • Nj detects a change  in the value Vi of Oi. • Change is not seen by other nodes Nk(kj) • The value Vi for an object Oi: Vi= j (Vi,j) • where Vi,j is the value of i-th object in the j-th node

Method (1) • Initialize a top-k set at the coordinator node • Set arithmetic constraints at monitor nodes • Depend on current top-k set • Constraints valid  No communications • Constraints invalidated • Client communicates with server • Possibly new top-k set • Recomputation of constraints

Method(2) - Adjustment Factors Adjustment Factors (AF) Object 1 Object 2 Object 1 Object 2 Node 1 Node 2 Top-1 = {O1} Node 2: V1,2 = 3+0 = 3 Node 2: V2,1 = 1+3 = 4 Local top-k similar to global =>Low communication cost Disadvantage: Energy consumption is not uniform Node 1, Local Top-1 = {O1} Node 2, Local Top-1 = {O2} Local top-ks differ from global top-k =>Unnecessary constraint violations => Increased communication cost To keep the results valid AF for each object sum to zero

Uncertain Data Compute probability of 6 tuples 16 possible worlds Sum the world probabilities • Pk-topk query: returns the k most probabletuples of being the top-k.Top-2: {6,5} with prob. {0.64, 0.5} source: pvldb08

Pk-topkQuery • Solution proposed by [Jin, PVLDB08] • Compact set based • Space-efficient solution • Discard unnecessary tuples and • Apply several compression schemes to compress data • Disadvantages • Model assumption: the probability of a tuple is assumed random and independent of each other.

Continuous Top-k Methods -Summary

Presentation Layout • Preliminaries • Continuous skyline queries • Continuous top-k queries • Continuous top-k dominating queries • Summary

Top-k Dominating Query - Example price Top-k: Given a preference function, a top-k query returns the k tuples with thebest scores. Top-k dominating: the answer contains the k tuples with highest domination power. T1 Skyline: contains all the tuples not dominated by any other tuple. T6 T2 T4 T5 k=2 k=1 F=price+distance T3 distance Disadvantage: High dimensionality problem. Disadvantage: user-defined preference function. Combines the advantages of skyline and top-k queries and avoids their disdvantages.

Continuous Top-k Dominating Query • Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series. • Application Example: sensor network • Areas with high probability of fire outbreak • Temperature, humidity and wind speed

Continuous Processing of Preference Queries in Data Streams : a Survey