140 likes | 307 Views
7. Sequence Mining. Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules. Sequences and Strings. A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence Sequences and strings are often used interchangeably
E N D
7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules Data Mining by H. Liu, ASU
Sequences and Strings • A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence • Sequences and strings are often used interchangeably • String elements (characters, letters, or symbols) are nominal • A particularly long string is called text • |x| denotes the length of sequence x • |AGCTTC| is 6 • Any contiguous string that is part of x is called a substring, segment, or factor of x • GCT is a factor of AGCTTC Data Mining by H. Liu, ASU
Recognition with Strings • String matching • Given x and text, determine whether x is a factor of text • Edit distance • Given two strings x and y, compute the min number of basic operations (character insertions, deletions and exchanges) needed to transform x into y Data Mining by H. Liu, ASU
String Matching • Given |text| >> |x|, each discrete character is taken from an alphabet A • A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} • A shift s is an offset needed to align the first character of x with character number s+1 in text • Find if there exists a valid shift where there is a perfect match between each character in x and the corresponding one in text Data Mining by H. Liu, ASU
Naïve String Matching • Given alphabet A, x, text, n = |text|, m = |x| s = 0 whiles ≤ n-m ifx[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s = s + 1 • Time complexity (worst case): O((n-m+1)m) • One character shift at a time is not necessary Data Mining by H. Liu, ASU
Boyer-Moore String Matching • Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 whiles ≤ n-m j = m while j>0 andx[j] = text [s+j] j = j-1 if j = 0 then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] Data Mining by H. Liu, ASU
Edit Distance • ED between x and y describes how many fundamental operations are required to transform x to y. • Fundamental operations (x=‘excused’, y=‘exhausted’) • Substitutions, ‘c’ is replaced by ‘h’ • Insertions, ‘a’ is inserted into x after ‘h’ • Deletions, a character in x is deleted • ED is one way of measuring similarity between two strings Data Mining by H. Liu, ASU
Classification using ED • Nearest-neighbor algorithm can be applied for pattern recognition. • Training: data of strings with their class labels stored • Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. • The key is how to calculate ED. • An example of calculating ED Data Mining by H. Liu, ASU
Hidden Markov Model • Markov Model: transitional states • Hidden Markov Model: additional visible states • Evaluation • Decoding • Learning Data Mining by H. Liu, ASU
Markov Model • The Markov property: • given the current state, the transition probability is independent of any previous states. • A simple Markov Model • State ω(t) at time t • Sequence of length T: • ωT = {ω(1), ω(2), …, ω(T)} • Transition probability • P(ωj(t+1)| ωi(t)) = aij • It’s not required that aij =aji Data Mining by H. Liu, ASU
Hidden Markov Model • Visible states • VT = {v(1), v(2), …, v(T)} • Emitting a visible state vk(t) • P(v k(t)| ωj(t)) = bjk • Only visible states vk (t) are accessibleand states ωi (t) are unobservable. • A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state. Data Mining by H. Liu, ASU
Three Key Issues with HMM • Evaluation • Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model • Decoding • Given an HMM and a set of observations VT. Determine the most likely sequence of hidden states ωT that led to VT. • Learning • Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk. Data Mining by H. Liu, ASU
Sequence Association Rule Mining • SPADE (Sequential Pattern Discovery using Equivalence classes) • Constrained sequence mining (SPIRIT) Data Mining by H. Liu, ASU
Bibliography • R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience. Data Mining by H. Liu, ASU