600 likes | 1.23k Views
Frequent Pattern Mining in Data Streams. Rishi Gosai. Frequent Pattern Analysis : Introduction. Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
E N D
Frequent Pattern Mining in Data Streams Rishi Gosai
Frequent Pattern Analysis : Introduction • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data • What products were often purchased together? • What are the subsequent purchases after buying a PC? • Can we automatically classify web documents? • Applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis
Customer buys both Customer buys diaper Customer buys beer Basic Concepts • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative)support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold
Customer buys both Customer buys diaper Customer buys beer Association Rules Find all the rules X Ywith minimum support and confidence • support, s, probability that a transaction contains X Y • confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association Rules: Beer Diaper (60%, 100%) Diaper Beer (60%, 75%)
The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;
The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan
{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list = f-c-a-b-m-p
Challenges of streaming • Single pass Data come as a continuous “stream” • Limited Memory Differs from traditional stored DB. The sheer volume of a stream over its lifetime is huge and queries require timely answer • Enumeration of itemsets
Purpose • Present algorithms computing frequency exceeding threshold • Simple • Low memory footprint • Output approximate, guaranteed not exceed a user specified error parameter. • Deployed for singleton items, handle variable sized sets of items. • Main contributions of the paper: • Proposed 2 algorithms to find frequent items appear in a data stream of items • Extended the algorithms to find frequent itemset
Notations • Some notations: • Let N denote the current length of the stream • Let s (0,1) denote the support threshold • Let (0,1) denote the error tolerance • << s
Approximation guarantees • All itemsets whose true frequency exceeds sN are reported • No itemset whose true frequency is less than (s-)N is output • Estimated frequencies are less than the true frequencies by at most N
Example • s = 0.1% • ε should be one-tenth or one-twentieth of s. ε = 0.01% • Property 1, elements frequency exceeding 0.1% output. • Property 2, NO element frequency below 0.09% output • Elements between 0.09% ~ 0.1% may or may not be output. • Property 3, frequencies are less than their true frequencies at most 0.01%
Problem definition • An algorithm maintains an ε-deficient synopsis if its output satisifies the aforementioned properties • Devise algorithms support ε-deficient synopsis using little main memory as possible
The Algorithms for frequent Items • Each transaction contains only 1 item • Two algorithms proposed: • Sticky Sampling Algorithm • Lossy Counting Algorithm • Features : • Sampling used • Frequency found approximate, error guaranteed not exceed user-specified tolerance level • For Lossy Counting, all frequent items are reported
Stream 28 31 41 34 15 30 23 35 19 Sticky Sampling Algorithm Create counters by sampling
Sticky Sampling Algorithm • User input : • Support threshold s • Error tolerance • Probability of failure • Counts kept in data structure S • Each entry in S is in the form (e,f), where: • e : item • f : frequency of e since the entry inserted in S • Output entries in S where f (s - )N
Sticky Sampling Algorithm • r : sampling rate • Sampling an element with rate = r means select the element with probablity = 1/r
Sticky Sampling Algorithm • Initially – S is empty, r = 1. • For each incoming element eif (e exists in S) increment corresponding felse { sample element with rate r if (sampled) add entry (e,1) to S else ignore }
Sampling rate • Let t = 1/ ε log(s-1-1) ( = probability of failure) • First 2t elements sampled at rate=1 • The next 2t at rate=2 • The next 4t at rate=4 and so on…
Sticky Sampling Algorithm Whenever the sampling rate r changes: for each entry (e,f) in S repeat { toss an unbiased coin if (toss is not successful) diminsh f by one if (f == 0) { delete entry from S break }} until toss is successful
Lossy Counting • Data stream conceptually divided into buckets w= 1/ transactions • Buckets labeled with bucket ids, starting from 1 • Current bucket idis bcurrent ,value is N/w • fe:true frequency of an element e in stream seen so far • Each entry in data structure D is form (e, f, ) • e : item • f : frequency of e • : the maximum possible error in f
Lossy Counting • D is the maximum # of times e occurred in the first bcurrent– 1 buckets ( this value is exactly bcurrent– 1) • Once a value is inserted into D its value D is unchanged
Lossy Counting • Initially D is empty • Receive element eif (e exists in D) increment its frequency (f) by 1else create a new entry (e, 1, bcurrent– 1) • If bucket boundary prune D by the following the rule:(e,f,D) is deleted if f + D ≤ bcurrent • When the user requests a list of items with threshold s, output those entries in D where f ≥ (s –ε)N
Lossy Counting • function prune(D, b) • for each entry (e,f,) in D do • if f + b do • remove the entry from D • endif
Frequency Counts + First Window At window boundary, remove entries that for themf+∆ ≤bcurrent Lossy Counting D is Empty
Frequency Counts + Next Window At window boundary, remove entries that for themf+∆≤bcurrent Lossy Counting
Lossy Counting • Lossy Counting guarantees that: • When deletion occurs, bcurrentN • Entry (e, f, ) is deleted, If fe bcurrent • fe : actual frequency count of e • Hence, if entry (e, f, ) is deleted, fe N • Finally, f fe f + N
Sticky Sampling vs Lossy Counting • Sticky Sampling is non-deterministic, while Lossy Counting is deterministic • Experimental result shows that Lossy Counting requires fewer entries than Sticky Sampling
Sticky Sampling vs Lossy Counting • Lossy counting is superior by a large factor • Sticky sampling performs worse because of its tendency to remember every unique element that gets sampled • Lossy counting is good at pruning low frequency elements quickly
The more complex case: finding frequent itemsets • The Lossy Counting algorithm is extended to find frequent itemsets • Transactions in the data stream contains a set of items
Stream Finding frequent itemsets
Finding frequent itemsets • Input: stream of transactions, each transaction is a set of items from I • N: length of the stream • User specifies two parameters: support s, error e • Challenge: - handling variable sized transactions- avoiding explicit enumeration of all subsets of any transaction
Finding frequent itemsets • Data structure D – set of entries of the form (set, f, D) • set : subset of items • Transactions are divided into buckets • w= 1/ transactions :# of transactions in each bucket • bcurrent: current bucket id
Finding frequent itemsets • Transactions not processed one by one. Main memory filled as many transactions as possible. Processing is done on a batch of transactions. • β: # of buckets in main memory in the current batch being processed.
Finding frequent itemsets • D’s operations : • UPDATE_SET updates and deletes in D • Entry (set, f, ) count occurrence of set in the batch and update the entry • If updated entry satisfies f + bcurrent, removed it from D • NEW_SET inserts new entries into D • If set set has frequency f in batch and set doesn’t occur in D, create a new entry (set, f, bcurrent-)
Finding frequent itemsets • If fset ≥ eN it has an entry in D • If (set,f,D)ED then the true frequency of fset satisfies the inequality f≤ fset ≤ f+D • When user requests list of items with threshold s, output in D wheref ≥ (s-e)N • β needs to be a large number. Any subset of I that occurs β +1 times or more contributes to D.
Buffer: repeatedly reads in a batch of buckets of transactions into available main memory • Trie: maintains the data structure D • SetGen: generates subsets of item-id’s along with their frequency counts in the current batch • Not all possible subsets need to be generated • If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
TRIE SUBSET-GEN BUFFER Three modules maintains the data structureD implement UPDATE_SET, NEW_SET operates on the current batch of transactions repeatedly reads in a batch of transactionsinto available main memory
In Main Memory Module 1 - Buffer Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 • Read a batch of transactions • Transactions are laid out one after the other in a big array • A bitmap is used to remember transaction boundaries • After reading in a batch, BUFFER sorts each transaction by its item-id’s
45 50 40 31 29 32 42 30 50 40 30 31 29 45 32 42 Sets with frequency counts Module 2 - TRIE
Module 2 – TRIE cont… • Nodes are labeled {item-id, f, D, level} • Children of any node are ordered by their item-id’s • Root nodes are also ordered by their item-id’s • A node represents an itemset consisting of item-id’s in that node and all its ancestors • TRIE is maintained as an array of entries of the form {item-id, f, D, level} (pre-order of the trees). Equivalent to a lexicographic ordering of subsets it encodes. • No pointers, level’s compactly encode the underlying tree structure.
3 3 3 4 2 2 1 2 1 3 1 1 Frequency counts of subsets in lexicographic order BUFFER Module 3 - SetGen SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN BUFFER TRIE new TRIE Overall Algorithm
Thoughts on the Paper • For frequent itemsets, they do scan each record many times • Sometimes, we are only interested in recent data • Good comparison metrics – more efficient than traditional apriori
Conclusion • Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items • Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic • Lossy Counting can be extended to find frequent itemsets