What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically

What’s Hot and What’s Not:Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by Tal Sterenzy

Motivation • A basic statistic on database relationship is which items are hot – occur frequently • Dynamically maintaining hot items in the presence of delete and insert transactions. • Examples: • DBMS – keep statistics to improve performance • Telecommunication networks - network connections start and end over time

Overview • Definitions • Prior work • Algorithm description & analysis • Experimental results • Summery

Formal definition • Sequence of n transactions on m items [1…m] • - Net occurrence of item i at time t • The number of times it has inserted minus the times it has been deleted • - current frequency of item at time t • - most frequent item at time t • The k most frequent items at time t are those with the k largest

Finding k hot items • k is a parameter • Item i is an hot item if • Frequent items that appear a significant fraction of the entire dataset • There can be at most k hot items, and there can be none • Assume basic integrity constraint

Our algorithm • highly efficient, randomized algorithm for maintaining hot items in a dynamically changing database • monitors the changes to the data distribution and maintains O(klogklogm) • When queried, we can find all hot items in time O(klogklogm) with probability 1-δ • No need to scan the underlying relation

Small tail assumption • Restriction: • are the frequencies of items • A set of frequencies has a small tail if • If there are k hot items  then small tail probability holds • If small tail probability holds  then some top k items might not be hot • We shall analyze our solution in the presence and absence of this small tail property (STP)

Prior work – why is it not adaptable? • All these algorithms hold counters: • incremented when the item is observed • decremented or reallocated under certain circumstances • Can’t directly adapt these algorithms for insertions and deletions: • the state of the algorithm is different to that reached without the insertions and deletions of the item. • Work on dynamic data is sparse, and provide no guarantees for the fully dynamic case with deletions

Our algorithm - idea • Do not keep counters of individual items, but rather of subsets of items • Ideas from group testing: • Design a number of tests, each of which group together a number of m items in order to find up to k items which test positive • Here: find k items that are hot • Minimize number of tests, where each group consists of a subset of items

General procedure • For each transaction on item i, determine which subsets it is included in: S(i) • Each subset has a counter: • For insertion: increment all S(i) counters • For deletion: decrement all S(i) counters • The test will be: does the counter exceed a threshold • Identifying the hot items is done by combining test results from several groups

The challenge is choosing the subsets • Bounding the number of required subsets • Finding concise representation of the groups • Giving efficiant way to go from results of tests to the sets of hot items • Lets start with a simple case: k=1 (freq>1/2)  Deterministic algorithm for maintaining majority item

Finding majority item • For insertions only, constant time and space • Keep logm+1 counters: • 1 counter of items “alive”: • The rest are labeled ,one per group • Each group represents a bit in the binary representation of the item • Each group consists of half of the items

Finding majority item – cont. • bit(i,j) – reports value of jth bit in binary representation of i • gt(i, j) – return 1 if i>j, 0 otherwise • Scheme: • Insertion of item i: Increment each counter such that bit(i, j) = 1 in time O(logm). • Deletion of i: Decrement each counter such that bit(i, j) = 1 in time O(logm). • Query: If there is a majority, then it is given by computed in time O(logm).

Finding majority item – cont. • Theorem: The algorithm finds a majority item if there is one with time O(logm) per operation • The state of the data structure is equivalent if there are I insertion and D deletions, or if there are c = I - D insertions • In case of insertions only: the majority is found

UpdateCounters procedure int c[0…logm] UpdateCounters(i,transtype,c[0…logm]) c[0]=c[0] + diff for j=1 to logm do If (transtype = ins) c[j] = c[j] + bit(j,i) Else c[j] = c[j] - bit(j,i)

FindMajority procedure FindMajority(c[0 ... log m]) Position = 0, t =1 for j=1 to log m do if (c[j] > c[0]/2) then position = position + t t = 2* t return position

Randomized constructions for finding hot items • Observation: If we select subsets with one hot item exactly applying the majority algorithm will identify the hot item • Definition:

How many subsets do we need? • Theorem:Picking O(k logk) subsets by drawing m/k items uniformly from [1…m] means that with constant probability we have included k good subsets S1…Sk such that • Proof: p – pick one item from F • O(k logk) subsets will guarantee with constant probability that we have one of each hot item (coupon’s collector problem)

Coupon collector problem • p is probability that coupon is good • X – number of trials required to collect at least one of each type of coupon • Epoch i begins with after i-th success and ends with (i+1)-th success • Xi – number of trials in the i-th epoch • Xi distributed geometrically and pi = p(k-i)/k

Defining the groups with universal hash functions • The groups are chosen in a pseudo-random way using universal hash functions: • Fix prime P > 2k • a, b are drawn uniformly from [0…P-1] • Then set: • Fact: Over all choices of a and b, for x<>y:

Choosing and updating the subsets • We will choose T = logk/δvalues of a and b,Which creates 2kT= 2klogk/δsubsets of items • Processing an item i means: • To which T sets i belongs? • For each one: update logm counters based on bit representation of i • If the set is good, this gives us the hot item

Space requirements • a and b are O(m): O(logk/δlogm) • Number of counters: 2k logk/δ (logm + 1) • Total space: O(k logk/δ logm) log(k/δ) choices of a,b 2k subsets log m + 1 counters

Probability of each hot item being in at least one good subset is at least 1-δ • Consider one hot item: For each T repetitions we put it in one of 2k groups • The expected total frequency of other items: • If f<1/(k+1)  majority will be found  success • If f>1/(k+1)  majority can’t be found  failure • Probability of failure < ½ (by Markov inequality) • Probability to fail on each T < • Probability of any hot items failing at most δ.

Detecting good subsets • Given a subset and it’s associated counters , it is possible to detect deterministically whether the subset is a good subset • Proof: a subset can fail in two cases: • No hot items (assuming STP) : then • More than one hot item: there will be j such that:  a good subset is determined

ProcessItem procedure Initialize c[0 … 2Tk][0 … log m] Draw a[1 … T], b[1 … T], c=0 ProccessItem(i,transtype,T,k) if (trans = ins) then c = c + 1 else c = c – 1 for x = 1 to T do index =2k(x-1)+(i*a[x]+b[x]modP)mod2k UpdateCounters(i,transtype,c[index])

GroupTest procedure GroupTest(T,k,b) for i=1 to 2Tk do if c[i][0] > cb position = 0; t =1 for j = 1 to log m do if(c[i][j] > cb and c[i][0] – c[i][j] > cb) then Skip to next i if c[i][j] > cb position += t t = 2 * t output position

Algorithm correctness • With probability at least 1-δ, calling the GroupTest(logk/δ,k,1/k+1)procedure finds all hot items. • Time processing item is: O(logk/δ logm) • Time to get all hot items is O(k logk/δ logm) • With or without STP, we are still guarenteed to include all hot items with high probability • Without STP, we might output infrequent items

Algorithm correctness – cont. • When will an infrequent item be output? (no STP) • A set with 2 hot items or more will be detected • A set with one hot item will never fault. Even if there is a split without the hot item that exceeds the threshold – it will be detected • A set with no hot item, and for all logm splits one half will exceed the threshold and the other not  only then the algorithm will fail

Algorithm properties • The set of counters created with T= log k/ δ can be used to find hot items with parameter k’ for any k’<k with probability of success 1 – δ by calling GroupTest(logk/δ,k,1/(k’+1)) • Proof: in the proof of probability for k hot items:

Experiments • GroupTesting algorithm was compared to Loosy Counting and Frequent algorithms. • The authors implemented them so that when an item is deleted we decrement the corresponding counter if such exist. • The recall is the proportion of the hot items that are found by the method to the total number of hot items. • The precision is the proportion of items identified by the algorithm, which are hot, to number of all output items.

Synthetic data (Recall) Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed

Synthetic data (Precision) Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed

Real data (Recall) Real data was obtained from one of AT&T network for part of a day.

Real Data (Percision) Real data has no guarantee of having small tail property

Varying frequency at query time The data structure was build for queries at the 0.5% level, but was then tested with queries ranged from 10% to 0.02%

Conclusions and extensions • New method which can cope with dynamic dataset is proposed. • It’s interesting to try to use the algorithm to compare the differences in frequencies between different datasets. • Can we find combinatorial design that achieve the same properties but in deterministic construction for maintaining hot items?

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically