240 likes | 372 Views
7. Sequence Mining. Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules. Sequences and Strings. A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence Sequences and strings are often used as synonyms
E N D
7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Sequences and Strings • A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence • Sequences and strings are often used as synonyms • String elements (characters, letters, or symbols) are nominal • A type of particularly long string text • |x| denotes the length of sequence x • |AGCTTC| is 6 • Any contiguous string that is part of x is called a substring, segment, or factor of x • GCT is a factor of AGCTTC Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Recognition with Strings • String matching • Given x and text, determine whether x is a factor of text • Edit distance (for inexact string matching) • Given two strings x and y, compute the minimum number of basic operations (character insertions, deletions and exchanges) needed to transform x into y Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
String Matching • Given |text| >> |x|, with characters taken from an alphabet A • A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} • A shift s is an offset needed to align the first character of x with character number s+1 in text • Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Naïve (Brute-Force) String Matching • Given A, x, text, n = |text|, m = |x| s = 0 whiles ≤ n-m ifx[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s = s + 1 • Time complexity (worst case): O((n-m+1)m) • One character shift at a time is not necessary Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Boyer-Moore and KMP • See StringMatching.ppt and do not use the following alg • Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 whiles ≤ n-m j = m while j>0 andx[j] = text [s+j] j = j-1 if j = 0 then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Edit Distance • ED between x and y describes how many fundamental operations are required to transform x to y. • Fundamental operations (x=‘excused’, y=‘exhausted’) • Substitutions e.g. ‘c’ is replaced by ‘h’ • Insertions e.g. ‘a’ is inserted into x after ‘h’ • Deletions e.g. a character in x is deleted • ED is one way of measuring similarity between two strings Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Classification using ED • Nearest-neighbor algorithm can be applied for pattern recognition. • Training: data of strings with their class labels stored • Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. • The key is how to calculate ED. • An example of calculating ED Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Hidden Markov Model • Markov Model: transitional states • Hidden Markov Model: additional visible states • Evaluation • Decoding • Learning Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Markov Model • The Markov property: • given the current state, the transition probability is independent of any previous states. • A simple Markov Model • State ω(t) at time t • Sequence of length T: • ωT = {ω(1), ω(2), …, ω(T)} • Transition probability • P(ωj(t+1)| ωi(t)) = aij • It’s not required that aij =aji Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Hidden Markov Model • Visible states • VT = {v(1), v(2), …, v(T)} • Emitting a visible state vk(t) • P(v k(t)| ωj(t)) = bjk • Only visible states vk (t) are accessibleand states ωi (t) are unobservable. • A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state. Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Three Key Issues with HMM • Evaluation • Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model • Decoding • Given an HMM and a set of observations VT. Determine the most likely sequence of hidden states ωT that led to VT. • Learning • Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk. Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Other Sequential Patterns Mining Problems • Sequence alignment (homology) and sequence assembly (genome sequencing) • Trend analysis • Trend movement vs. cyclic variations, seasonal variations and random fluctuations • Sequential pattern mining • Various kinds of sequences (weblogs) • Various methods: From GSP to PrefixSpan • Periodicity analysis • Full periodicity, partial periodicity, cyclic association rules Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Periodic Pattern • Full periodic pattern • ABCABCABC • Partial periodic pattern • ABC ADC ACC ABC • Pattern hierarchy • ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE Sequences of transactions [ABC:3|DE:4] Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Sequence Association Rule Mining • SPADE (Sequential Pattern Discovery using Equivalence classes) • Constrained sequence mining (SPIRIT) Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
Bibliography • R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience. Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
a11 a22 a12 1 2 a21 a13 a32 a23 a31 3 a33 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
v1 v1 a11 a22 a12 b21 b11 v2 b22 v2 1 2 b12 b23 a21 v3 b24 b13 v3 b14 a32 a13 a23 v4 v4 a31 v3 v1 3 b33 b31 b34 b32 v2 a33 v4 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
vk 1(2) b2k 1 1 1 1 1 ………… a12 2(2) a22 2 2 2 2 2 ………… a32 3(2) 3 3 3 3 3 ………… . . . ac2 . . . . . . . . . . . . c(2) c c c c c ………… t = 1 2 3 T-1 T Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
v3 v1 v3 v2 v0 0 0 0 0 0.0011 0 0.2 x 2 1 0.09 0.0052 0.0024 0 0.3 x 0.3 1 0.1 x 0.1 0 0.01 0.0077 0.0002 0 2 0.4 x 0.5 0.2 0 0.0057 0.0007 0 3 t = 0 1 2 3 4 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
3 4 5 6 7 1 2 0 /v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/ Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
max(T) 0 0 0 0 0 0 ………… max(1) 1 1 1 1 1 1 ………… max(3) max(T-1) 2 2 2 2 2 2 ………… max(2) 3 3 3 3 3 3 ………… . . . . . . . . . . . . . . . . . . c c c c c c ………… t = 1 2 3 4 T-1 T Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)
v3 v1 v3 v2 v0 0 0 0 0 0.0011 0 1 0.09 0.0052 0.0024 0 1 0 0.01 0.0077 0.0002 0 2 0.2 0 0.0057 0.0007 0 3 t = 0 1 2 3 4 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU)