Estimating Rarity and Similarity over Data stream Windows

Estimating Rarity and Similarity over Data stream Windows Paper written by: Mayur Datar S. Muthukrishnan Effi Goldstein

Agenda • Introduction • Motivation of windowed data stream algorithms • Define the problems • The impressive Results • Introducing the Algorithmic Tools we’ll use • Algorithm for Estimating rarity and similarity in unbounded data stream model • Algorithm for Estimating rarity and similarity over Windowed data streams

Introduction - motivation • The sliding window model • Often used for “observations” – telecom’ networks (packets in routers, telephone calls…) • Retrieving information “on the fly”… (I.e. highway control, stock exchange…) • Important restriction - we are only allowed polylogarithmic (in window size) storage space. • This is very difficult – consider the problem of calculating the minimum… • That’s why we settle for a good estimation…

Introduction - motivation • Motivation for rarity and similarity – extracts unique and interesting information in a data stream • Rarity – • estimate the portion of users who are not satisfied… (online-stores) • Indication for DenialOfService. • Similarity – • What are the commonly items in a market-basket. • Similarity in IP-address in two web-sites… • All of these examples are well-motivated for commercial uses.

Introduction - the problems • Recall our work space: • the window (of size N) - • set of items - [U] = {1,…,u}. • Rarity - • an item x is a-rare if x appears precisely a times in the set. • #a-rare = no. of such items in the set. • #distinct = no. of distinct items in the set. • a-rarity -

Introduction - the problems • Rarity… examples: S = { 2, 3, 2, 4, 3, 1, 2, 4}D(istinct) = {1,2,3,4}1-rare = {1} 1-rarity = 1/42-rare = {3, 4} 2-rarity = 1/23-rare = {2} 3-rarity = 1/4 • note that 1-rarity is the fraction of items that do not repeat within the window.

Introduction - the problems • Similarity - here we have two sets A & B • define X(t) and Y(t) to be the set of distinct items • we use the `Jaccard coefficient` to measure their similarity: • similarity…example: A = {1,2,4,2,5} B = {2,3,1,3,2,6}X(t) = {1,2,4,5}Y(t) = {2,3,1,6} --> = 2/6

Introduction - how good are the results... • First important result is… there is no other known estimation for rarity & similarity in a windowed model ! • This is the reason there are no graphs at the end… • The final algorithm uses only • O(logN + logU) space • O(log logN) time • And estimates the results r, s with approximation of 1+e, where e can be reduced to any required constant.

Algorithmic Tools... • Min-wise hashing • set p to be a random permutation over [U], and , • the min-hash value for A for p iswhich is actually the element with the smallest index after permuting the subset. • The hashing function should be unique-value (one-to-one function) on the set [U]. • I.e.- permutation

Algorithmic Tools…- min-hash example • For example: consider the hash-functions:p1 = (1 2 3 4 5) = x mod 5p2 = (5 4 3 2 1)p3 = (3 4 5 1 2)p4(x) = 2x+1 mod 5 = p2and the sets: A = {1,3,4}B = {2,5} C = {1,2,4}Their min-hash values are as follows: hp1(A) = 1 hp1(B) = 2 hp1(C) = 1 hp2(A) = 4 hp2(B) = 5 hp2(C) = 4 hp3(A) = 3 hp3(B) = 5 hp3(C) = 4

Algorithmic Tools…- min-hash power... • An important property of min-hash functions:simple to prove… however, leads to powerful results: • Lemma 1: Let be k independent min-hash values for the set A (B). Let S`(A, b) be the fraction of the min-hash values that they agree on –

Algorithmic Tools…- min-hash families... • Thus… we will need to find a set of independent min-hash functions. • Ideal family of min-hash functions is the set of all permutations over [U].However, it’ll require O(u log u) bits to represent any permutation. We can’t afford that. We need to find something else...

Algorithmic Tools…- min-hash families... • Approximate min-hash family or otherwise known as e`-min-wise-independent hash family. • They have the property that for any we get • It has proven that any function from this family can be represented by only O(log u log(1/e`) ) bits, and be computed in O(log(1/e`)) time ! • The mentioned Lemma 1 still holds for this family!We just need to set the value of k appropriately in terms of e`, and change the expected error from er to er+e`.

Algorithmic Tools…- min-hash families... • To conclude, we will only need O(log u log(1/e`) ) bits for storing hash functions and O(k) hashes, to get an approximation for the lemma !

Estimating Rarity - in unbounded window • Recall our goal: find , up to precision p, at any time t. • Define: S - multiset. the actual data stream. D - set of distinct items from S -set of items who appear exactly a times in S ==>

Estimating Rarity - in unbounded window • Note 1: , and thus • Note 2: iff the min-hash value of D appears exactly a times in S. ==> Hence, it suffices to maintain only min-hash values for D only, as long as we can count the no. of appearances.

Estimating Rarity - in unbounded window • Tosummarize:what we want is ra, which equals by our definition, which equals (Note 1),which in turn equals {l|1<l<k, hl(Ra)=hl(D)}\k (Lemma 1), which suffices to count of min-hash values of D that are a-rare (Note 2).These observations lead to following Algorithm:

Estimating Rarity - in unbounded window • The Algorithm:choose k min-hash functions . K will be determined later.Maintain: - hi*(t) = which is the min-hash value of the window by time t. - Ci(t) counters of the no. of appearances of hi*(t).Initialize the min-hash values (hi*) to , and counters to 0.When item a(t+1) arrives: 1) for each i - compute hi(t+1) 2) if hi(t+1) < hi*(t), update hi*(t+1)=hi(t+1), Ci(t+1)=1 3) if hi(t+1) = hi*(t), increment Ci(t+1) 4) set hi*(t+1) to hi*(t), Ci(t+1) =Ci(t) for each i, process the next item a(t+2).

Estimating Rarity - in unbounded window • Now, we merely need to sum up all Ci(t)’s that equals a,since from Note 2 + our summarize we get { l | 1<l<k, hl(rat)=hl(Dt) } = { l | 1<l<k, Ci(t)= a } • Space complexity - we need O(k) for min-hash values (hi*) and the counters (Ci),O(k) seeds for the e`-min-hash functions (hi), that each needs O(log u log (1/e`)) bits to store.we set k in terms of e`(the desired accuracy), but in any case k=O(1).Finaly, we get space complexity O(log u log (1/e`)) !

Estimating Rarity - in unbounded window • Timecomplexity -in each step we need to compute k values of the e`-min-hash functions, which takes O(k log(1/e`)), also compare and sum up k values.Since k=O(1), we get time complexity O(log(1/e`)).

Estimating Similarity - in unbounded window • Our goal: given 2 data streams X & Y we want to estimate • which, by Lemma 1, equals { l | 1<l<k, hl(Xt) = hl(Yt) } \ k. • we actually use an easier version of the algorithm of rarity -since now we only need to compare the hi*(t) that X & Y produced at time t: when item at arrives, we compute hi(at) and set hi*(t) = min{ hi(t), hi*(t-1) } • space and time complexity are as before.

Estimating Similarity - in window data streams • We now consider the window • We want to use a similar approach as in the unbounded window, but maintaining a min-hash value here is difficult. • instead, we keep a list of possible min-hash values (and prove later that it is short enough…): • we use a “domination” property of min-hash functions:

Estimating Similarity - in window data streams • some definitions first – • an “active” item is an item who still ‘lives’ in the window boundary. • An active item a2 “dominates” active item a1, if it arrived later in the window, but hi(a2) < hi(a1) (has smaller min-hash value).Notice that a “dominated” item will never get to be a min-hash value of hi within the window size, since there is always a ‘preferred’ item... • “dominance” property example:

window size N=5 12 75 13 26 14 23 15 20 10 20 11 12 16 70 20 12 20 • Thus, the plan is to build a “dominance” min-hash values linked list! • The list elements consists of the pair { hi(aj), j) for active items aj at time t. • the list will accept only elements that satisfy the “dominating” property: j1 < j2 < … < jl & hi(aj1) < … < hi(ajl) Estimating Similarity - in window data streams Dominating item:

Estimating Similarity - in window data streams • Note that now hi*(t) = hi(aj1) !(hi*(t) is the min-hash value in the window) • The algorithm for maintaining :- when item arrives, we compute .- delete all items in the list, that have have bigger hash value (they are all being dominated…)- if equals the last hash value on the list, just update that pair with last arrival time.- else, append the pair ( , t+1) to the end of the list.- check if the first item on the list has not ‘expired’. If it has - delete it (it is no longer ‘active’).

Min-hash list example: 20 32 10 20 10 20 11 12 11 12 12 75 12 75 13 26 13 26 14 23 15 20 20 16 15 16 15 17 29 17 29 18 40 18 40 19 45 19 45 20 32 14 23 15 20 Estimating Similarity - in window data streams Min-hash list: We only have to make sure the list Li isn’t too long. We use...

Estimating Similarity - in window data streams • Lemma 2 - with high probability, the length of is | | = Q(HN), where HN is the Nth harmonic number (1+1/2+1/3+…+1/N), which is …. O(logN). • Since we now know what is the min-hash value, hi*, in the window (the first item on the list, )We now follow the logic we used in the unbounded stream – • We saw that – (Lemma 1) • So just compare the min-hash values of the min-hash family, for both streams X & Y.

Estimating Similarity - in window data streams • Space complexity – we use O(k) hash-functions, for each one we keep a linked-list of size O(log N), with elements of size O(log u) each one.Overall, we use space complexity O((log N)(log u)) • Time complexity – when updating the list , we need to search the appropriate place to insert the new item. Since the list is ordered, it is a simple heap-insertion. we get O(log| |) = O(log log N).

Estimating Rarity - in window data streams • We use a similar concept to the one we used earlier: • we still want to keep a linked-list of “dominant” min-hash values • But since now we need to find a instances of an item, we keep several arrival times of the item. • So now, each entry is the pairwhere is an ordered list of the latest a time instances of the item • So the list now looks like:

Estimating Rarity - in window data streams • Note that here, we store a list of a instances of an item, while previously we stored only the latest arrival time of each item in list – which is the largest value in the list. • The algorithm for maintaining , resembles the one before: - when item arrives, we compute .- delete all items in the list, that have have bigger hash value (they are all being “dominated”…)- if equals the last hash value on the list, append t+1 to the list . If the list now has more than a items – delete the first one.

Estimating Rarity - in window data streams - else, append the pair ( , {t+1} ) to the end of the list, where the arrival list here is a singleton.- check if the first arrival time of the first item on the list, has not ‘expired’. If it has - delete it (it is no longer ‘active’). • The list length here, is O(a logN). Using Lemma 2 (here we have a elements for each item).

Estimating Rarity - in window data streams • And the same logic holds…since we getfrom Lemma 1 we getfrom Note 2 we get iff the min-hash value of D appears in a times in the window. • Thus… we only have to count the min-hash values hi* (=hi(aj1)) that their arrival-time list is a long !!

Estimating Rarity - in window data streams • Space complexity – we use O(k) hash-functions, for each one we keep a linked-list of size O(a log N), with elements of size O(log u) each one.Overall, we use space complexity O((log N)(log u)) • Time complexity – updating the list , costs exactly as in the similarity’s list.We get time complexity O(log log N).

Concluding remarks • The algorithms presented here, are the first solutions for the windowed Rarity and Similarity problems (the authors claim..) • Citation from the article –“We expect our technique to find applications in practice…”

Estimating Rarity and Similarity over Data stream Windows

Estimating Rarity and Similarity over Data stream Windows

Presentation Transcript

Rarity and Extinction

Data Stream Mining

Data Stream Processor

Maintaining Stream Statistics Over Sliding Windows

Stream Data

Biodiversity: Habitat Quality and Rarity

Data Stream Clustering

Data Stream Protocol

Data Stream Management

Data Stream Computation

Maintaining Variance over Data Stream Windows

Answering Arbitrary Conjunctive Queries over Incomplete Data Stream Histories

SeqStream: Mining Closed Sequential Pattern over Stream Sliding Windows

Data Stream Mining

Multiple Aggregations over Data Stream

Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data

Data Stream Processing and Analytics

Statistic estimation over data stream

Overheads in Data Stream Over WLAN

Estimating the Sortedness of a Data Stream

Data Stream Mining