1 / 23

7. Sequence Mining

7. Sequence Mining. Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules. Sequences and Strings. A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence Sequences and strings are often used as synonyms

libitha
Download Presentation

7. Sequence Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  2. Sequences and Strings • A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence • Sequences and strings are often used as synonyms • String elements (characters, letters, or symbols) are nominal • A type of particularly long string text • |x| denotes the length of sequence x • |AGCTTC| is 6 • Any contiguous string that is part of x is called a substring, segment, or factor of x • GCT is a factor of AGCTTC Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  3. Recognition with Strings • String matching • Given x and text, determine whether x is a factor of text • Edit distance (for inexact string matching) • Given two strings x and y, compute the minimum number of basic operations (character insertions, deletions and exchanges) needed to transform x into y Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  4. String Matching • Given |text| >> |x|, with characters taken from an alphabet A • A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} • A shift s is an offset needed to align the first character of x with character number s+1 in text • Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  5. Naïve (Brute-Force) String Matching • Given A, x, text, n = |text|, m = |x| s = 0 whiles ≤ n-m ifx[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s = s + 1 • Time complexity (worst case): O((n-m+1)m) • One character shift at a time is not necessary Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  6. Boyer-Moore and KMP • See StringMatching.ppt and do not use the following alg • Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 whiles ≤ n-m j = m while j>0 andx[j] = text [s+j] j = j-1 if j = 0 then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  7. Edit Distance • ED between x and y describes how many fundamental operations are required to transform x to y. • Fundamental operations (x=‘excused’, y=‘exhausted’) • Substitutions e.g. ‘c’ is replaced by ‘h’ • Insertions e.g. ‘a’ is inserted into x after ‘h’ • Deletions e.g. a character in x is deleted • ED is one way of measuring similarity between two strings Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  8. Classification using ED • Nearest-neighbor algorithm can be applied for pattern recognition. • Training: data of strings with their class labels stored • Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. • The key is how to calculate ED. • An example of calculating ED Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  9. Hidden Markov Model • Markov Model: transitional states • Hidden Markov Model: additional visible states • Evaluation • Decoding • Learning Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  10. Markov Model • The Markov property: • given the current state, the transition probability is independent of any previous states. • A simple Markov Model • State ω(t) at time t • Sequence of length T: • ωT = {ω(1), ω(2), …, ω(T)} • Transition probability • P(ωj(t+1)| ωi(t)) = aij • It’s not required that aij =aji Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  11. Hidden Markov Model • Visible states • VT = {v(1), v(2), …, v(T)} • Emitting a visible state vk(t) • P(v k(t)| ωj(t)) = bjk • Only visible states vk (t) are accessibleand states ωi (t) are unobservable. • A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state. Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  12. Three Key Issues with HMM • Evaluation • Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model • Decoding • Given an HMM and a set of observations VT. Determine the most likely sequence of hidden states ωT that led to VT. • Learning • Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk. Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  13. Other Sequential Patterns Mining Problems • Sequence alignment (homology) and sequence assembly (genome sequencing) • Trend analysis • Trend movement vs. cyclic variations, seasonal variations and random fluctuations • Sequential pattern mining • Various kinds of sequences (weblogs) • Various methods: From GSP to PrefixSpan • Periodicity analysis • Full periodicity, partial periodicity, cyclic association rules Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  14. Periodic Pattern • Full periodic pattern • ABCABCABC • Partial periodic pattern • ABC ADC ACC ABC • Pattern hierarchy • ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE Sequences of transactions [ABC:3|DE:4] Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  15. Sequence Association Rule Mining • SPADE (Sequential Pattern Discovery using Equivalence classes) • Constrained sequence mining (SPIRIT) Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  16. Bibliography • R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience. Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  17. a11 a22 a12 1 2 a21 a13 a32 a23 a31 3 a33 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  18. v1 v1 a11 a22 a12 b21 b11 v2 b22 v2 1 2 b12 b23 a21 v3 b24 b13 v3 b14 a32 a13 a23 v4 v4 a31 v3 v1 3 b33 b31 b34 b32 v2 a33 v4 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  19. vk 1(2) b2k 1 1 1 1 1 ………… a12 2(2) a22 2 2 2 2 2 ………… a32 3(2) 3 3 3 3 3 ………… . . . ac2 . . . . . . . . . . . . c(2) c c c c c ………… t = 1 2 3 T-1 T Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  20. v3 v1 v3 v2 v0 0 0 0 0 0.0011 0 0.2 x 2 1 0.09 0.0052 0.0024 0 0.3 x 0.3 1 0.1 x 0.1 0 0.01 0.0077 0.0002 0 2 0.4 x 0.5 0.2 0 0.0057 0.0007 0 3 t = 0 1 2 3 4 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  21. 3 4 5 6 7 1 2 0 /v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/ Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  22. max(T) 0 0 0 0 0 0 ………… max(1) 1 1 1 1 1 1 ………… max(3) max(T-1) 2 2 2 2 2 2 ………… max(2) 3 3 3 3 3 3 ………… . . . . . . . . . . . . . . . . . . c c c c c c ………… t = 1 2 3 4 T-1 T Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

  23. v3 v1 v3 v2 v0 0 0 0 0 0.0011 0 1 0.09 0.0052 0.0024 0 1 0 0.01 0.0077 0.0002 0 2 0.2 0 0.0057 0.0007 0 3 t = 0 1 2 3 4 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)

More Related