370 likes | 471 Views
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically. By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by Tal Sterenzy. Motivation. A basic statistic on database relationship is which items are hot – occur frequently
E N D
What’s Hot and What’s Not:Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by Tal Sterenzy
Motivation • A basic statistic on database relationship is which items are hot – occur frequently • Dynamically maintaining hot items in the presence of delete and insert transactions. • Examples: • DBMS – keep statistics to improve performance • Telecommunication networks - network connections start and end over time
Overview • Definitions • Prior work • Algorithm description & analysis • Experimental results • Summery
Formal definition • Sequence of n transactions on m items [1…m] • - Net occurrence of item i at time t • The number of times it has inserted minus the times it has been deleted • - current frequency of item at time t • - most frequent item at time t • The k most frequent items at time t are those with the k largest
Finding k hot items • k is a parameter • Item i is an hot item if • Frequent items that appear a significant fraction of the entire dataset • There can be at most k hot items, and there can be none • Assume basic integrity constraint
Our algorithm • highly efficient, randomized algorithm for maintaining hot items in a dynamically changing database • monitors the changes to the data distribution and maintains O(klogklogm) • When queried, we can find all hot items in time O(klogklogm) with probability 1-δ • No need to scan the underlying relation
Small tail assumption • Restriction: • are the frequencies of items • A set of frequencies has a small tail if • If there are k hot items then small tail probability holds • If small tail probability holds then some top k items might not be hot • We shall analyze our solution in the presence and absence of this small tail property (STP)
Prior work – why is it not adaptable? • All these algorithms hold counters: • incremented when the item is observed • decremented or reallocated under certain circumstances • Can’t directly adapt these algorithms for insertions and deletions: • the state of the algorithm is different to that reached without the insertions and deletions of the item. • Work on dynamic data is sparse, and provide no guarantees for the fully dynamic case with deletions
Our algorithm - idea • Do not keep counters of individual items, but rather of subsets of items • Ideas from group testing: • Design a number of tests, each of which group together a number of m items in order to find up to k items which test positive • Here: find k items that are hot • Minimize number of tests, where each group consists of a subset of items
General procedure • For each transaction on item i, determine which subsets it is included in: S(i) • Each subset has a counter: • For insertion: increment all S(i) counters • For deletion: decrement all S(i) counters • The test will be: does the counter exceed a threshold • Identifying the hot items is done by combining test results from several groups
The challenge is choosing the subsets • Bounding the number of required subsets • Finding concise representation of the groups • Giving efficiant way to go from results of tests to the sets of hot items • Lets start with a simple case: k=1 (freq>1/2) Deterministic algorithm for maintaining majority item
Finding majority item • For insertions only, constant time and space • Keep logm+1 counters: • 1 counter of items “alive”: • The rest are labeled ,one per group • Each group represents a bit in the binary representation of the item • Each group consists of half of the items
Finding majority item – cont. • bit(i,j) – reports value of jth bit in binary representation of i • gt(i, j) – return 1 if i>j, 0 otherwise • Scheme: • Insertion of item i: Increment each counter such that bit(i, j) = 1 in time O(logm). • Deletion of i: Decrement each counter such that bit(i, j) = 1 in time O(logm). • Query: If there is a majority, then it is given by computed in time O(logm).
Finding majority item – cont. • Theorem: The algorithm finds a majority item if there is one with time O(logm) per operation • The state of the data structure is equivalent if there are I insertion and D deletions, or if there are c = I - D insertions • In case of insertions only: the majority is found
UpdateCounters procedure int c[0…logm] UpdateCounters(i,transtype,c[0…logm]) c[0]=c[0] + diff for j=1 to logm do If (transtype = ins) c[j] = c[j] + bit(j,i) Else c[j] = c[j] - bit(j,i)
FindMajority procedure FindMajority(c[0 ... log m]) Position = 0, t =1 for j=1 to log m do if (c[j] > c[0]/2) then position = position + t t = 2* t return position
Randomized constructions for finding hot items • Observation: If we select subsets with one hot item exactly applying the majority algorithm will identify the hot item • Definition:
How many subsets do we need? • Theorem:Picking O(k logk) subsets by drawing m/k items uniformly from [1…m] means that with constant probability we have included k good subsets S1…Sk such that • Proof: p – pick one item from F • O(k logk) subsets will guarantee with constant probability that we have one of each hot item (coupon’s collector problem)
Coupon collector problem • p is probability that coupon is good • X – number of trials required to collect at least one of each type of coupon • Epoch i begins with after i-th success and ends with (i+1)-th success • Xi – number of trials in the i-th epoch • Xi distributed geometrically and pi = p(k-i)/k
Defining the groups with universal hash functions • The groups are chosen in a pseudo-random way using universal hash functions: • Fix prime P > 2k • a, b are drawn uniformly from [0…P-1] • Then set: • Fact: Over all choices of a and b, for x<>y:
Choosing and updating the subsets • We will choose T = logk/δvalues of a and b,Which creates 2kT= 2klogk/δsubsets of items • Processing an item i means: • To which T sets i belongs? • For each one: update logm counters based on bit representation of i • If the set is good, this gives us the hot item
Space requirements • a and b are O(m): O(logk/δlogm) • Number of counters: 2k logk/δ (logm + 1) • Total space: O(k logk/δ logm) log(k/δ) choices of a,b 2k subsets log m + 1 counters
Probability of each hot item being in at least one good subset is at least 1-δ • Consider one hot item: For each T repetitions we put it in one of 2k groups • The expected total frequency of other items: • If f<1/(k+1) majority will be found success • If f>1/(k+1) majority can’t be found failure • Probability of failure < ½ (by Markov inequality) • Probability to fail on each T < • Probability of any hot items failing at most δ.
Detecting good subsets • Given a subset and it’s associated counters , it is possible to detect deterministically whether the subset is a good subset • Proof: a subset can fail in two cases: • No hot items (assuming STP) : then • More than one hot item: there will be j such that: a good subset is determined
ProcessItem procedure Initialize c[0 … 2Tk][0 … log m] Draw a[1 … T], b[1 … T], c=0 ProccessItem(i,transtype,T,k) if (trans = ins) then c = c + 1 else c = c – 1 for x = 1 to T do index =2k(x-1)+(i*a[x]+b[x]modP)mod2k UpdateCounters(i,transtype,c[index])
GroupTest procedure GroupTest(T,k,b) for i=1 to 2Tk do if c[i][0] > cb position = 0; t =1 for j = 1 to log m do if(c[i][j] > cb and c[i][0] – c[i][j] > cb) then Skip to next i if c[i][j] > cb position += t t = 2 * t output position
Algorithm correctness • With probability at least 1-δ, calling the GroupTest(logk/δ,k,1/k+1)procedure finds all hot items. • Time processing item is: O(logk/δ logm) • Time to get all hot items is O(k logk/δ logm) • With or without STP, we are still guarenteed to include all hot items with high probability • Without STP, we might output infrequent items
Algorithm correctness – cont. • When will an infrequent item be output? (no STP) • A set with 2 hot items or more will be detected • A set with one hot item will never fault. Even if there is a split without the hot item that exceeds the threshold – it will be detected • A set with no hot item, and for all logm splits one half will exceed the threshold and the other not only then the algorithm will fail
Algorithm properties • The set of counters created with T= log k/ δ can be used to find hot items with parameter k’ for any k’<k with probability of success 1 – δ by calling GroupTest(logk/δ,k,1/(k’+1)) • Proof: in the proof of probability for k hot items:
Experiments • GroupTesting algorithm was compared to Loosy Counting and Frequent algorithms. • The authors implemented them so that when an item is deleted we decrement the corresponding counter if such exist. • The recall is the proportion of the hot items that are found by the method to the total number of hot items. • The precision is the proportion of items identified by the algorithm, which are hot, to number of all output items.
Synthetic data (Recall) Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed
Synthetic data (Precision) Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed
Real data (Recall) Real data was obtained from one of AT&T network for part of a day.
Real Data (Percision) Real data has no guarantee of having small tail property
Varying frequency at query time The data structure was build for queries at the 0.5% level, but was then tested with queries ranged from 10% to 0.02%
Conclusions and extensions • New method which can cope with dynamic dataset is proposed. • It’s interesting to try to use the algorithm to compare the differences in frequencies between different datasets. • Can we find combinatorial design that achieve the same properties but in deterministic construction for maintaining hot items?