Frequent Pattern Mining in Data Streams

Frequent Pattern Mining in Data Streams Rishi Gosai

Frequent Pattern Analysis : Introduction • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data • What products were often purchased together? • What are the subsequent purchases after buying a PC? • Can we automatically classify web documents? • Applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

Customer buys both Customer buys diaper Customer buys beer Basic Concepts • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative)support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold

Customer buys both Customer buys diaper Customer buys beer Association Rules Find all the rules X  Ywith minimum support and confidence • support, s, probability that a transaction contains X  Y • confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association Rules: Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%)

The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;

The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list = f-c-a-b-m-p

Challenges of streaming • Single pass Data come as a continuous “stream” • Limited Memory Differs from traditional stored DB. The sheer volume of a stream over its lifetime is huge and queries require timely answer • Enumeration of itemsets

Purpose • Present algorithms computing frequency exceeding threshold • Simple • Low memory footprint • Output approximate, guaranteed not exceed a user specified error parameter. • Deployed for singleton items, handle variable sized sets of items. • Main contributions of the paper: • Proposed 2 algorithms to find frequent items appear in a data stream of items • Extended the algorithms to find frequent itemset

Notations • Some notations: • Let N denote the current length of the stream • Let s (0,1) denote the support threshold • Let  (0,1) denote the error tolerance •  << s

Approximation guarantees • All itemsets whose true frequency exceeds sN are reported • No itemset whose true frequency is less than (s-)N is output • Estimated frequencies are less than the true frequencies by at most N

Example • s = 0.1% • ε should be one-tenth or one-twentieth of s. ε = 0.01% • Property 1, elements frequency exceeding 0.1% output. • Property 2, NO element frequency below 0.09% output • Elements between 0.09% ~ 0.1% may or may not be output. • Property 3, frequencies are less than their true frequencies at most 0.01%

Problem definition • An algorithm maintains an ε-deficient synopsis if its output satisifies the aforementioned properties • Devise algorithms support ε-deficient synopsis using little main memory as possible

The Algorithms for frequent Items • Each transaction contains only 1 item • Two algorithms proposed: • Sticky Sampling Algorithm • Lossy Counting Algorithm • Features : • Sampling used • Frequency found approximate, error guaranteed not exceed user-specified tolerance level • For Lossy Counting, all frequent items are reported

Stream 28 31 41 34 15 30 23 35 19 Sticky Sampling Algorithm  Create counters by sampling

Sticky Sampling Algorithm • User input : • Support threshold s • Error tolerance  • Probability of failure  • Counts kept in data structure S • Each entry in S is in the form (e,f), where: • e : item • f : frequency of e since the entry inserted in S • Output entries in S where f  (s - )N

Sticky Sampling Algorithm • r : sampling rate • Sampling an element with rate = r means select the element with probablity = 1/r

Sticky Sampling Algorithm • Initially – S is empty, r = 1. • For each incoming element eif (e exists in S) increment corresponding felse { sample element with rate r if (sampled) add entry (e,1) to S else ignore }

Sampling rate • Let t = 1/ ε log(s-1-1) ( = probability of failure) • First 2t elements sampled at rate=1 • The next 2t at rate=2 • The next 4t at rate=4 and so on…

Sticky Sampling Algorithm Whenever the sampling rate r changes: for each entry (e,f) in S repeat { toss an unbiased coin if (toss is not successful) diminsh f by one if (f == 0) { delete entry from S break }} until toss is successful

Lossy Counting • Data stream conceptually divided into buckets w= 1/ transactions • Buckets labeled with bucket ids, starting from 1 • Current bucket idis bcurrent ,value is  N/w • fe:true frequency of an element e in stream seen so far • Each entry in data structure D is form (e, f, ) • e : item • f : frequency of e •  : the maximum possible error in f

Lossy Counting • D is the maximum # of times e occurred in the first bcurrent– 1 buckets ( this value is exactly bcurrent– 1) • Once a value is inserted into D its value D is unchanged

Lossy Counting • Initially D is empty • Receive element eif (e exists in D) increment its frequency (f) by 1else create a new entry (e, 1, bcurrent– 1) • If bucket boundary prune D by the following the rule:(e,f,D) is deleted if f + D ≤ bcurrent • When the user requests a list of items with threshold s, output those entries in D where f ≥ (s –ε)N

Lossy Counting • function prune(D, b) • for each entry (e,f,) in D do • if f +   b do • remove the entry from D • endif

Frequency Counts + First Window At window boundary, remove entries that for themf+∆ ≤bcurrent Lossy Counting D is Empty

Frequency Counts + Next Window At window boundary, remove entries that for themf+∆≤bcurrent Lossy Counting

Lossy Counting • Lossy Counting guarantees that: • When deletion occurs, bcurrentN • Entry (e, f, ) is deleted, If fe bcurrent • fe : actual frequency count of e • Hence, if entry (e, f, ) is deleted, fe N • Finally, f fe f + N

Sticky Sampling vs Lossy Counting • Sticky Sampling is non-deterministic, while Lossy Counting is deterministic • Experimental result shows that Lossy Counting requires fewer entries than Sticky Sampling

Sticky Sampling vs Lossy Counting • Lossy counting is superior by a large factor • Sticky sampling performs worse because of its tendency to remember every unique element that gets sampled • Lossy counting is good at pruning low frequency elements quickly

The more complex case: finding frequent itemsets • The Lossy Counting algorithm is extended to find frequent itemsets • Transactions in the data stream contains a set of items

Stream Finding frequent itemsets

Finding frequent itemsets • Input: stream of transactions, each transaction is a set of items from I • N: length of the stream • User specifies two parameters: support s, error e • Challenge: - handling variable sized transactions- avoiding explicit enumeration of all subsets of any transaction

Finding frequent itemsets • Data structure D – set of entries of the form (set, f, D) • set : subset of items • Transactions are divided into buckets • w= 1/ transactions :# of transactions in each bucket • bcurrent: current bucket id

Finding frequent itemsets • Transactions not processed one by one. Main memory filled as many transactions as possible. Processing is done on a batch of transactions. • β: # of buckets in main memory in the current batch being processed.

Finding frequent itemsets • D’s operations : • UPDATE_SET updates and deletes in D • Entry (set, f, ) count occurrence of set in the batch and update the entry • If updated entry satisfies f +   bcurrent, removed it from D • NEW_SET inserts new entries into D • If set set has frequency f   in batch and set doesn’t occur in D, create a new entry (set, f, bcurrent-)

Finding frequent itemsets • If fset ≥ eN it has an entry in D • If (set,f,D)ED then the true frequency of fset satisfies the inequality f≤ fset ≤ f+D • When user requests list of items with threshold s, output in D wheref ≥ (s-e)N • β needs to be a large number. Any subset of I that occurs β +1 times or more contributes to D.

Buffer: repeatedly reads in a batch of buckets of transactions into available main memory • Trie: maintains the data structure D • SetGen: generates subsets of item-id’s along with their frequency counts in the current batch • Not all possible subsets need to be generated • If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

TRIE SUBSET-GEN BUFFER Three modules maintains the data structureD implement UPDATE_SET, NEW_SET operates on the current batch of transactions repeatedly reads in a batch of transactionsinto available main memory

In Main Memory Module 1 - Buffer Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 • Read a batch of transactions • Transactions are laid out one after the other in a big array • A bitmap is used to remember transaction boundaries • After reading in a batch, BUFFER sorts each transaction by its item-id’s

45 50 40 31 29 32 42 30 50 40 30 31 29 45 32 42 Sets with frequency counts Module 2 - TRIE

Module 2 – TRIE cont… • Nodes are labeled {item-id, f, D, level} • Children of any node are ordered by their item-id’s • Root nodes are also ordered by their item-id’s • A node represents an itemset consisting of item-id’s in that node and all its ancestors • TRIE is maintained as an array of entries of the form {item-id, f, D, level} (pre-order of the trees). Equivalent to a lexicographic ordering of subsets it encodes. • No pointers, level’s compactly encode the underlying tree structure.

3 3 3 4 2 2 1 2 1 3 1 1 Frequency counts of subsets in lexicographic order BUFFER Module 3 - SetGen SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN BUFFER TRIE new TRIE Overall Algorithm

Results

Thoughts on the Paper • For frequent itemsets, they do scan each record many times • Sometimes, we are only interested in recent data • Good comparison metrics – more efficient than traditional apriori

Conclusion • Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items • Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic • Lossy Counting can be extended to find frequent itemsets

Frequent Pattern Mining in Data Streams

Frequent Pattern Mining in Data Streams

Presentation Transcript

Frequent Pattern Mining

Finding Frequent Items in Data Streams

Data Mining on Streams

Summarization of Frequent Pattern Mining

Mining Data Streams

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Finding Frequent Items in Distributed Data Streams

Mining Data Streams

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Chapter 4 – Frequent Pattern Mining

Mining Data Streams

Constrained Frequent Itemset Mining from Uncertain Data Streams

Mining Compressed Frequent-Pattern Sets

Mining Data Streams

Mining Data Streams

Frequent Pattern Mining

Mining Compressed Frequent-Pattern Sets

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Mining Data Streams

Finding Frequent Items in Data Streams