1 / 13

CONTOUR: an efficient algorithm for discovering discriminating subsequences

CONTOUR: an efficient algorithm for discovering discriminating subsequences. Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis , Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei- Shen Tai 200 9 / 3/11. Outline. Introduction Problem formulation

Download Presentation

CONTOUR: an efficient algorithm for discovering discriminating subsequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CONTOUR: an efficient algorithm for discovering discriminating subsequences Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei-Shen Tai 2009/3/11

  2. Outline • Introduction • Problem formulation • Efficiently mining summarization subsequences • Summarization subsequence based clustering • Empirical results • Conclusions • Comments

  3. Motivation • Make frequent sequence mining more efficient • It is very time consuming to mine the complete set of frequent subsequences for large sequence databases. • A subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm.

  4. Objective • Effective search space pruning methods • Finding the summarization subsequence to represent original input sequence.

  5. Problem formulation CABAC→BAC • Subsequence • If sequence Sαis contained in sequence Sβ, Sαis called a subsequence of Sβ. • Absolute support of sequence • The number of input sequences in SDB that contain Sα, denoted by supSDB(Sα). • Summarization subsequences • A set of representative subsequences as a concise summarization of the input sequences, • Internal similarity of micro-cluster Cλ

  6. Efficiently mining summarization subsequences • Frequent subsequence enumeration • For each prefix, the mining algorithm builds its projected database, and computes the set of locally frequent events. min_sup = 2 SDB| AA ={C,A}, but they cannot be used to extend the prefix AA. (AAC, AAA )

  7. Closed sequence-based optimization • BackScan search space pruning • Semi-maximum period • A subsequence between the first instance and the last instance of subsequence P. (for example, prefix BB) • First, and second to m semi-maximum period • An event A appears in each of the first semi-maximum periods of BB. It means ABB and BB exist simultaneously, ABB is the longer one. ABCBA ABCBA ABCBA →ABCB →ABCB ABCB ACBB ABCB

  8. Unpromising projected sequence pruning • Current Frequent Covering Subsequence • An input sequence Si that has the largest weight and was discovered so far. • Trivial projected sequence • Short projected sequences may not contain sufficient number of events to generate any summarization subsequence. • For example, prefix p=C:5 • SDB|p = {PS1 =ABAC, PS3 = B, PS4 = BAC, PS5 = BBA, PS6 = BC}, • CFCS1 =ABA:3, CFCS3 =ABCB:2, CFCS4 =BAC:2, CFCS5 =ABA:3, and CFCS6 =ABCB:2.

  9. Further discussions • Event weight assignment • It is similar to TFIDF concept • Multiple summarization subsequence mining • An input sequence may support multiple summarization subsequences.

  10. Summarization subsequence based clustering • Micro-cluster generation • Input sequences with the same summarization subsequence are grouped together. • Macro-cluster creation • Agglomerative hierarchical clustering paradigm to create K macro-clusters. ABA ABCB CBAC

  11. Empiricalresults

  12. Conclusions • CONTOUR • A set of summarization subsequences is a concise representation of the original sequence database. • It preserves much structural information, and can be used to efficiently cluster the input sequences with a high clustering quality.

  13. Comments • Advantage • This method provides more concise representation of original sequences than feature selection methods. • Those summarization subsequences can be efficiently adopted in most of conventional sequence mining methods. • Drawback • In equation 1 and 2, the internal similarity is computed under one summarization subsequence. Whereas, the multiple summarization subsequences may not be suitable for these equations. • Application • Sequence pattern mining and clustering.

More Related