Approximate Mining of Consensus Sequential Patterns

Approximate Mining of Consensus Sequential Patterns Hye-Chung (Monica) Kum University of North Carolina, Chapel Hill Computer Science Department School of Social Work http://www.cs.unc.edu/~kum/approxMAP

Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" • The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" • The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner • combining ideas drawn from databases, machine learning, artificial intelligence, knowledge-based systems, information retrieval, statistics, pattern recognition, visualization, and parallel and distributed computing • Fayyad, Piatetsky-Shapiro, Smyth 1996

What is KDD ? • Purpose • Extract useful information • Source • Operational or Administrative Data • Example • VIC card database for buying patterns • monthly welfare service patterns

Example • Analyze buying patterns for sales marketing

Example • VIC card : 4/8 = 50%

Example • VIC card : 5/8=63%

Overview • What is KDD (Knowledge Discovery & Data mining) • Problem : Sequential Pattern Mining • Method : ApproxMAP • Evaluation Method • Results • Case Study • Conclusion

Sequential Pattern Mining

Sequential Pattern Mining • Detecting patterns in sequences of sets

Welfare Program Participation Patterns • What are the common participation patterns ? • What are the variations to them ? • How do different policies affect these patterns?

Thesis Statement • The author of this dissertation asserts that multiple alignment is an effective model to uncover the underlying trend in sequences of sets. • I will show that approxMAP, • is a novel method to apply multiple alignment techniques to sequences of sets, • will effectively extract the underlying trend in the data • by organizing the large database into clusters • as well as give reasonable descriptors (weighted sequences and consensus sequences) for the clusters via multiple alignment • Furthermore, I will show that approxMAP • is robust to its input parameters, • is robust to noise and outliers in the data, • scalable with respect to the size of the database, • and in comparison to the conventional support model, approxMAP can better recover the underlying pattern with little confounding information under most circumstances. • In addition, I will demonstrate the usefulness of approxMAP using real world data.

Thesis Statement • Multiple alignment is an effective model to uncover the underlying trend in sequences of sets. • ApproxMAP is a novel method to apply multiple alignment techniques to sequences of sets. • ApproxMAP can recover the underlying patterns with little confounding information under most circumstances including those in which the conventional methods fail. • I will demonstrate the usefulness of approxMAP using real world data.

Sequential Pattern Mining • Detecting patterns in sequences of sets • Nseq: Total # of sequences in the Database • Lseq: Avg # of itemsets in a sequence • Iseq : Avg # of items in an itemset • Lseq * Iseq : Avg length of a sequence

Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)

Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D)

Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D) • Support (P ): # of super-sequences of P in D

Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D) • Support (P ): # of super-sequences of P in D • Given D, and user threshold, min_sup • find complete set of P s.t. Support(P ) min_sup

Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D) • Support (P ): # of super-sequences of P in D • Given D, and user threshold, min_sup • find complete set of P s.t. Support(P ) min_sup • R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 • Methods • Breadth first – Apriori Principle (GSP) • R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 • Depth first – pattern growth (PrefixSpan) • J. Han and J. Pei : SIGKDD 2000 & ICDE 2001

Example: Support Model • {Dp, Br} {Mk, Dp} {Mk, Dp, Br} : 2/3=67% • 2L - 1= 27-1=128-1=127 subsequences • {Br} {Mk, Dp} {Mk, Dp, Br} • {Dp} {Mk, Dp} {Mk, Dp, Br} • {Dp, Br} {Dp} {Mk, Dp, Br} • {Dp, Br} {Mk} {Mk, Dp, Br} • {Dp, Br} {Mk, Dp} {Dp, Br} • {Dp, Br} {Mk, Dp} {Mk, Br} • {Dp, Br} {Mk, Dp} {Mk, Dp} • {Mk, Dp} {Mk, Dp, Br} • {Dp, Br} {Mk, Dp, Br} • … etc …

Inherent Problems : the model • Support • cannot distinguish between statistically significant patterns and random occurrences • Theoretically • Short random sequences occur often in long sequential data simply by chance • Empirically • # of spurious patterns grows exponential w.r.t. Lseq

Inherent Problems : exact match • A pattern gets support • the pattern is exactly contained in the sequence • Often may not find general long patterns • Example • many customers may share similar buying habits • few of them follow an exactly same pattern

Inherent Problems : Complete set • Mines complete set • Too many trivial patterns • Given long sequences with noise • too expensive and too many patterns • 2L - 1= 210-1=1023 • Finding max / closed sequential patterns • is non-trivial • In noisy environment, still too many max/closed patterns

Possible Models • Support model • Patterns in sets • unordered list • Multiple alignment model • Find common patterns among strings • Simple ordered list of characters

Multiple Alignment • line up the sequences to detect the trend • Find common patterns among strings • DNA / bio sequences

Edit Distance • Pairwise Score(edit distance) : dist(seq1, seq2) • Minimum # of ops required to change seq1 to seq2 • Ops = INDEL(a) and/or REPLACE(a,b) • Recurrence relation

Edit Distance • Pairwise Score(edit distance) : dist(seq1, seq2) • Minimum # of ops required to change seq1 to seq2 • Ops = INDEL(a) and/or REPLACE(a,b) • Recurrence relation • Multiple Alignment Score • ∑PS(seqi, seqj) ( 1 ≤ i ≤ N and 1≤ j ≤ N) • Optimal alignment : minimum score

Consensus Sequence

Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence

Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence • strength(i, j) = # of occurrences of item i in position j total # of sequences • A : 3/3 = 100% • E : 1/3 = 33% • H : 1/3 = 33%

Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence • strength(i, j) = # of occurrences of item i in position j total # of sequences • Consensus itemset (j) : min_strength=2 • ( ia |  ia(I ()) & strength(ia, j) ≥ min_strength )

Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence • strength(i, j) = # of occurrences of item i in position j total # of sequences • Consensus itemset (j) : min_strength=2 • ( ia |  ia(I ()) & strength(ia, j) ≥ min_strength ) • Consensus sequence : • concatenation of the consensus itemsets

Multiple Alignment Sequential Pattern Mining • Given • N sequences of sets, • Op costs (INDEL & REPLACE) for itemsets, and • Strength thresholds for consensus sequences • To (1)partition the N sequences into K sets of sequences such that the sum of the K multiple alignment scores is minimum, and (2) find the multiple alignment for each partition, and (3) find the pattern consensus sequence and the variation consensus sequence for each partition

Overview • What is KDD (Knowledge Discovery & Data mining) • Problem : Sequential Pattern Mining • Method : ApproxMAP • Evaluation Method • Results • Case Study • Conclusion

ApproxMAP (Approximate Multiple Alignment Pattern mining) • Exact solution : Too expensive! • Approximation Method : ApproxMAP • Organize into K partitions • Use clustering • Compress each partition into • weighted sequences • Summarize each partition into • Pattern consensus sequence • Variation consensus sequence

Tasks • Op costs (INDEL & REPLACE) for itemsets • Organize into K partitions • Use clustering • Compress each partition into • weighted sequences • Summarize each partition into • Pattern consensus sequence • Variation consensus sequence

Op costs for itemsets • Normalized set difference • R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) • 0 ≤ R ≤ 1 , metric • INDEL(X) = R(X,) = 1

Approximate Mining of Consensus Sequential Patterns