130 likes | 233 Views
CONTOUR: an efficient algorithm for discovering discriminating subsequences. Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis , Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei- Shen Tai 200 9 / 3/11. Outline. Introduction Problem formulation
E N D
CONTOUR: an efficient algorithm for discovering discriminating subsequences Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei-Shen Tai 2009/3/11
Outline • Introduction • Problem formulation • Efficiently mining summarization subsequences • Summarization subsequence based clustering • Empirical results • Conclusions • Comments
Motivation • Make frequent sequence mining more efficient • It is very time consuming to mine the complete set of frequent subsequences for large sequence databases. • A subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm.
Objective • Effective search space pruning methods • Finding the summarization subsequence to represent original input sequence.
Problem formulation CABAC→BAC • Subsequence • If sequence Sαis contained in sequence Sβ, Sαis called a subsequence of Sβ. • Absolute support of sequence • The number of input sequences in SDB that contain Sα, denoted by supSDB(Sα). • Summarization subsequences • A set of representative subsequences as a concise summarization of the input sequences, • Internal similarity of micro-cluster Cλ
Efficiently mining summarization subsequences • Frequent subsequence enumeration • For each prefix, the mining algorithm builds its projected database, and computes the set of locally frequent events. min_sup = 2 SDB| AA ={C,A}, but they cannot be used to extend the prefix AA. (AAC, AAA )
Closed sequence-based optimization • BackScan search space pruning • Semi-maximum period • A subsequence between the first instance and the last instance of subsequence P. (for example, prefix BB) • First, and second to m semi-maximum period • An event A appears in each of the first semi-maximum periods of BB. It means ABB and BB exist simultaneously, ABB is the longer one. ABCBA ABCBA ABCBA →ABCB →ABCB ABCB ACBB ABCB
Unpromising projected sequence pruning • Current Frequent Covering Subsequence • An input sequence Si that has the largest weight and was discovered so far. • Trivial projected sequence • Short projected sequences may not contain sufficient number of events to generate any summarization subsequence. • For example, prefix p=C:5 • SDB|p = {PS1 =ABAC, PS3 = B, PS4 = BAC, PS5 = BBA, PS6 = BC}, • CFCS1 =ABA:3, CFCS3 =ABCB:2, CFCS4 =BAC:2, CFCS5 =ABA:3, and CFCS6 =ABCB:2.
Further discussions • Event weight assignment • It is similar to TFIDF concept • Multiple summarization subsequence mining • An input sequence may support multiple summarization subsequences.
Summarization subsequence based clustering • Micro-cluster generation • Input sequences with the same summarization subsequence are grouped together. • Macro-cluster creation • Agglomerative hierarchical clustering paradigm to create K macro-clusters. ABA ABCB CBAC
Conclusions • CONTOUR • A set of summarization subsequences is a concise representation of the original sequence database. • It preserves much structural information, and can be used to efficiently cluster the input sequences with a high clustering quality.
Comments • Advantage • This method provides more concise representation of original sequences than feature selection methods. • Those summarization subsequences can be efficiently adopted in most of conventional sequence mining methods. • Drawback • In equation 1 and 2, the internal similarity is computed under one summarization subsequence. Whereas, the multiple summarization subsequences may not be suitable for these equations. • Application • Sequence pattern mining and clustering.