340 likes | 460 Views
Estimating Rarity and Similarity over Data stream Windows . Paper written by: Mayur Datar S. Muthukrishnan Effi Goldstein. Agenda. Introduction Motivation of windowed data stream algorithms Define the problems The impressive Results Introducing the Algorithmic Tools we’ll use
E N D
Estimating Rarity and Similarity over Data stream Windows Paper written by: Mayur Datar S. Muthukrishnan Effi Goldstein
Agenda • Introduction • Motivation of windowed data stream algorithms • Define the problems • The impressive Results • Introducing the Algorithmic Tools we’ll use • Algorithm for Estimating rarity and similarity in unbounded data stream model • Algorithm for Estimating rarity and similarity over Windowed data streams
Introduction - motivation • The sliding window model • Often used for “observations” – telecom’ networks (packets in routers, telephone calls…) • Retrieving information “on the fly”… (I.e. highway control, stock exchange…) • Important restriction - we are only allowed polylogarithmic (in window size) storage space. • This is very difficult – consider the problem of calculating the minimum… • That’s why we settle for a good estimation…
Introduction - motivation • Motivation for rarity and similarity – extracts unique and interesting information in a data stream • Rarity – • estimate the portion of users who are not satisfied… (online-stores) • Indication for DenialOfService. • Similarity – • What are the commonly items in a market-basket. • Similarity in IP-address in two web-sites… • All of these examples are well-motivated for commercial uses.
Introduction - the problems • Recall our work space: • the window (of size N) - • set of items - [U] = {1,…,u}. • Rarity - • an item x is a-rare if x appears precisely a times in the set. • #a-rare = no. of such items in the set. • #distinct = no. of distinct items in the set. • a-rarity -
Introduction - the problems • Rarity… examples: S = { 2, 3, 2, 4, 3, 1, 2, 4}D(istinct) = {1,2,3,4}1-rare = {1} 1-rarity = 1/42-rare = {3, 4} 2-rarity = 1/23-rare = {2} 3-rarity = 1/4 • note that 1-rarity is the fraction of items that do not repeat within the window.
Introduction - the problems • Similarity - here we have two sets A & B • define X(t) and Y(t) to be the set of distinct items • we use the `Jaccard coefficient` to measure their similarity: • similarity…example: A = {1,2,4,2,5} B = {2,3,1,3,2,6}X(t) = {1,2,4,5}Y(t) = {2,3,1,6} --> = 2/6
Introduction - how good are the results... • First important result is… there is no other known estimation for rarity & similarity in a windowed model ! • This is the reason there are no graphs at the end… • The final algorithm uses only • O(logN + logU) space • O(log logN) time • And estimates the results r, s with approximation of 1+e, where e can be reduced to any required constant.
Algorithmic Tools... • Min-wise hashing • set p to be a random permutation over [U], and , • the min-hash value for A for p iswhich is actually the element with the smallest index after permuting the subset. • The hashing function should be unique-value (one-to-one function) on the set [U]. • I.e.- permutation
Algorithmic Tools…- min-hash example • For example: consider the hash-functions:p1 = (1 2 3 4 5) = x mod 5p2 = (5 4 3 2 1)p3 = (3 4 5 1 2)p4(x) = 2x+1 mod 5 = p2and the sets: A = {1,3,4}B = {2,5} C = {1,2,4}Their min-hash values are as follows: hp1(A) = 1 hp1(B) = 2 hp1(C) = 1 hp2(A) = 4 hp2(B) = 5 hp2(C) = 4 hp3(A) = 3 hp3(B) = 5 hp3(C) = 4
Algorithmic Tools…- min-hash power... • An important property of min-hash functions:simple to prove… however, leads to powerful results: • Lemma 1: Let be k independent min-hash values for the set A (B). Let S`(A, b) be the fraction of the min-hash values that they agree on –
Algorithmic Tools…- min-hash families... • Thus… we will need to find a set of independent min-hash functions. • Ideal family of min-hash functions is the set of all permutations over [U].However, it’ll require O(u log u) bits to represent any permutation. We can’t afford that. We need to find something else...
Algorithmic Tools…- min-hash families... • Approximate min-hash family or otherwise known as e`-min-wise-independent hash family. • They have the property that for any we get • It has proven that any function from this family can be represented by only O(log u log(1/e`) ) bits, and be computed in O(log(1/e`)) time ! • The mentioned Lemma 1 still holds for this family!We just need to set the value of k appropriately in terms of e`, and change the expected error from er to er+e`.
Algorithmic Tools…- min-hash families... • To conclude, we will only need O(log u log(1/e`) ) bits for storing hash functions and O(k) hashes, to get an approximation for the lemma !
Estimating Rarity - in unbounded window • Recall our goal: find , up to precision p, at any time t. • Define: S - multiset. the actual data stream. D - set of distinct items from S -set of items who appear exactly a times in S ==>
Estimating Rarity - in unbounded window • Note 1: , and thus • Note 2: iff the min-hash value of D appears exactly a times in S. ==> Hence, it suffices to maintain only min-hash values for D only, as long as we can count the no. of appearances.
Estimating Rarity - in unbounded window • Tosummarize:what we want is ra, which equals by our definition, which equals (Note 1),which in turn equals {l|1<l<k, hl(Ra)=hl(D)}\k (Lemma 1), which suffices to count of min-hash values of D that are a-rare (Note 2).These observations lead to following Algorithm:
Estimating Rarity - in unbounded window • The Algorithm:choose k min-hash functions . K will be determined later.Maintain: - hi*(t) = which is the min-hash value of the window by time t. - Ci(t) counters of the no. of appearances of hi*(t).Initialize the min-hash values (hi*) to , and counters to 0.When item a(t+1) arrives: 1) for each i - compute hi(t+1) 2) if hi(t+1) < hi*(t), update hi*(t+1)=hi(t+1), Ci(t+1)=1 3) if hi(t+1) = hi*(t), increment Ci(t+1) 4) set hi*(t+1) to hi*(t), Ci(t+1) =Ci(t) for each i, process the next item a(t+2).
Estimating Rarity - in unbounded window • Now, we merely need to sum up all Ci(t)’s that equals a,since from Note 2 + our summarize we get { l | 1<l<k, hl(rat)=hl(Dt) } = { l | 1<l<k, Ci(t)= a } • Space complexity - we need O(k) for min-hash values (hi*) and the counters (Ci),O(k) seeds for the e`-min-hash functions (hi), that each needs O(log u log (1/e`)) bits to store.we set k in terms of e`(the desired accuracy), but in any case k=O(1).Finaly, we get space complexity O(log u log (1/e`)) !
Estimating Rarity - in unbounded window • Timecomplexity -in each step we need to compute k values of the e`-min-hash functions, which takes O(k log(1/e`)), also compare and sum up k values.Since k=O(1), we get time complexity O(log(1/e`)).
Estimating Similarity - in unbounded window • Our goal: given 2 data streams X & Y we want to estimate • which, by Lemma 1, equals { l | 1<l<k, hl(Xt) = hl(Yt) } \ k. • we actually use an easier version of the algorithm of rarity -since now we only need to compare the hi*(t) that X & Y produced at time t: when item at arrives, we compute hi(at) and set hi*(t) = min{ hi(t), hi*(t-1) } • space and time complexity are as before.
Estimating Similarity - in window data streams • We now consider the window • We want to use a similar approach as in the unbounded window, but maintaining a min-hash value here is difficult. • instead, we keep a list of possible min-hash values (and prove later that it is short enough…): • we use a “domination” property of min-hash functions:
Estimating Similarity - in window data streams • some definitions first – • an “active” item is an item who still ‘lives’ in the window boundary. • An active item a2 “dominates” active item a1, if it arrived later in the window, but hi(a2) < hi(a1) (has smaller min-hash value).Notice that a “dominated” item will never get to be a min-hash value of hi within the window size, since there is always a ‘preferred’ item... • “dominance” property example:
window size N=5 12 75 13 26 14 23 15 20 10 20 11 12 16 70 20 12 20 • Thus, the plan is to build a “dominance” min-hash values linked list! • The list elements consists of the pair { hi(aj), j) for active items aj at time t. • the list will accept only elements that satisfy the “dominating” property: j1 < j2 < … < jl & hi(aj1) < … < hi(ajl) Estimating Similarity - in window data streams Dominating item:
Estimating Similarity - in window data streams • Note that now hi*(t) = hi(aj1) !(hi*(t) is the min-hash value in the window) • The algorithm for maintaining :- when item arrives, we compute .- delete all items in the list, that have have bigger hash value (they are all being dominated…)- if equals the last hash value on the list, just update that pair with last arrival time.- else, append the pair ( , t+1) to the end of the list.- check if the first item on the list has not ‘expired’. If it has - delete it (it is no longer ‘active’).
Min-hash list example: 20 32 10 20 10 20 11 12 11 12 12 75 12 75 13 26 13 26 14 23 15 20 20 16 15 16 15 17 29 17 29 18 40 18 40 19 45 19 45 20 32 14 23 15 20 Estimating Similarity - in window data streams Min-hash list: We only have to make sure the list Li isn’t too long. We use...
Estimating Similarity - in window data streams • Lemma 2 - with high probability, the length of is | | = Q(HN), where HN is the Nth harmonic number (1+1/2+1/3+…+1/N), which is …. O(logN). • Since we now know what is the min-hash value, hi*, in the window (the first item on the list, )We now follow the logic we used in the unbounded stream – • We saw that – (Lemma 1) • So just compare the min-hash values of the min-hash family, for both streams X & Y.
Estimating Similarity - in window data streams • Space complexity – we use O(k) hash-functions, for each one we keep a linked-list of size O(log N), with elements of size O(log u) each one.Overall, we use space complexity O((log N)(log u)) • Time complexity – when updating the list , we need to search the appropriate place to insert the new item. Since the list is ordered, it is a simple heap-insertion. we get O(log| |) = O(log log N).
Estimating Rarity - in window data streams • We use a similar concept to the one we used earlier: • we still want to keep a linked-list of “dominant” min-hash values • But since now we need to find a instances of an item, we keep several arrival times of the item. • So now, each entry is the pairwhere is an ordered list of the latest a time instances of the item • So the list now looks like:
Estimating Rarity - in window data streams • Note that here, we store a list of a instances of an item, while previously we stored only the latest arrival time of each item in list – which is the largest value in the list. • The algorithm for maintaining , resembles the one before: - when item arrives, we compute .- delete all items in the list, that have have bigger hash value (they are all being “dominated”…)- if equals the last hash value on the list, append t+1 to the list . If the list now has more than a items – delete the first one.
Estimating Rarity - in window data streams - else, append the pair ( , {t+1} ) to the end of the list, where the arrival list here is a singleton.- check if the first arrival time of the first item on the list, has not ‘expired’. If it has - delete it (it is no longer ‘active’). • The list length here, is O(a logN). Using Lemma 2 (here we have a elements for each item).
Estimating Rarity - in window data streams • And the same logic holds…since we getfrom Lemma 1 we getfrom Note 2 we get iff the min-hash value of D appears in a times in the window. • Thus… we only have to count the min-hash values hi* (=hi(aj1)) that their arrival-time list is a long !!
Estimating Rarity - in window data streams • Space complexity – we use O(k) hash-functions, for each one we keep a linked-list of size O(a log N), with elements of size O(log u) each one.Overall, we use space complexity O((log N)(log u)) • Time complexity – updating the list , costs exactly as in the similarity’s list.We get time complexity O(log log N).
Concluding remarks • The algorithms presented here, are the first solutions for the windowed Rarity and Similarity problems (the authors claim..) • Citation from the article –“We expect our technique to find applications in practice…”