1 / 63

Sequential Data Mining

Sequential Data Mining. Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011. Sequential Data. Sequence data: a series of alphabets or numbers Spatial sequence data : biological sequence, text document, a computer program, web page, a missile trajectory

hal
Download Presentation

Sequential Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011

  2. Sequential Data • Sequence data: a series of alphabets or numbers • Spatial sequence data: biological sequence, text document, a computer program, web page, a missile trajectory • Temporal sequence data: speech, movie clips, stock prices, a series of web clicks

  3. Key Issue • Measure the similarity between two sequences, i.e. sequence comparison • Sequence database search / retrieval • Sequence classification • Sequence clustering • Sequence pattern recognition • Sequence similarity definition and calculation is a challenge compared to vector-based data

  4. Sequence Alignment • Use biological sequences as an example • Optimal pairwise sequence alignment algorithm • Fast, heuristic sequence alignment (BLAST) • Statistical significance of sequence alignment score

  5. Sequence Data • DNA Sequence (gene) gaagagagcttcaggtttggggaagagaca acaactcccgctcagaagcaggggccgata • RNA Sequence caaaucacuacacacaggguagaagguggaacgcacaggagcaugucaacggggugc • Protein Sequence RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP

  6. Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL ITAKPQWLKTSESVTFLSFLLPQTQGLYHL

  7. Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL ITAKPQWLKTSESVTFLSFLLPQTQGLYHL ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL Alignment (similarity) score

  8. Three Main Issues • Definition of alignment score • Algorithms of finding the optimal alignment • Evaluation of significance of alignment score

  9. A Simple Scoring Scheme • Score of a character pair: S(match)=1, S(not_match) = -1, S(gap-char) = -1 • Score of an alignment = ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL 5 – 7 – 7 +10 -4 + 4= 1

  10. Optimization Problem • How many possible alignments exist for two sequences with length m and n? • How can we find the best alignment to maximize alignment score?

  11. Possible Alignments Shortest ACTG--- ATGGATG Intermediate AC-TG--- ATG-GATG Longest ACTG------- ----ATGGATG ACTG ATGGATG

  12. Total Number of Possible Alignments AGATCAGAAAT-G --AT-AG-AATCC m + n

  13. Total Number of Alignments Select m positions out of m+n possible positions: Exponential! If m = 300, n = 300, total = 1037

  14. Needleman and Wunsch Algorithm • Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j]. • M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

  15. Needleman and Wunsch Algorithm • Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j]. • M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] Dynamic Programming

  16. Dynamic Programming Algorithm Three-Step Algorithm: • Initialization • Matrix filling (scoring) • Trace back (alignment)

  17. 1. Initialization of Matrix M j _ A T A G A A T _ A G A T C A i G A A A T G

  18. 2. Fill Matrix _ A T A G A A T _ A G A T M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] C A G A A A T G

  19. 2. Fill Matrix _ A T A G A A T _ A G A T M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] C A G A A A T G

  20. 2. Fill Matrix _ A T A G A A T _ A G A T M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] C A G A A A T G

  21. 3. Trace Back _ A T A G A A T _ A G A T C A AGATCAGAAATG G --AT-AG-AAT- A A A T G

  22. Pairwise Alignment Algorithm Using Dynamic Programming • Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column according to scoring matrix. • For j in 1..n (column) for i in 1..m (row) M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) ) Record the selected path toward (i,j) • Report alignment score M[m,n] and trace back to M[0,0] to generate the optimal alignment. What is the time complexity of this algorithm?

  23. Local Sequence Alignment Using DP • Biological sequences often only have local similarity. • During evolution, only functional and structural important regions are highly conserved. • Global alignment sacrifices the local similarity to maximize the global alignment score. • Need to use alignment method to identify the local similar regions disregard of other dissimilar regions.

  24. Transcription binding site

  25. Local Alignment Algorithm Goal: find an alignment of the substrings of P and Q with maximum alignment score. Naïve Algorithm: (m+1)*m/2 substrings of P, (n+1)*n/2 substrings of Q Using DP for each substring pairs: m2 * n2 * O(mn) = O(m3n3)

  26. Smith-Waterman Algorithm Same dynamic programming algorithm as global alignment except for three differences. • All negative scores is converted to 0 (why?) • Alignment can start from anywhere in the matrix • Alignment can end at anywhere in the matrix

  27. Local Alignment Algorithm • Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column to 0s. • For j in 1..n (column) for i in 1..m (row) M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) ) Record the selected path. • Find elements in matrix M with maximum values. Trace back till 0 and report the alignment corresponding to the path.

  28. 1. Initialization _ A T A G A A T _ A G A T C A G A A A T G

  29. 2. Fill the Matrix _ A T A G A A T _ A G A T C A G A A A T G

  30. 3. Trace Back _ A T A G A A T _ A G A T C A G A A A T G

  31. Two Best Local Alignments Local Alignment 2: Local Alignment 1: ATCAGAAAT AT-AGA-AT ATCAGAA AT-AGAA

  32. Scoring Matrix • How to accurately measure the similarity between amino acids (or nucleotides) is one key issue of sequence alignment. • For nucleotides, a simple identical / not identical scheme is mostly ok. • Due to various properties of amino acids, it is hard and also critical to measure the similarity between amino acids.

  33. Evolutionary Substitution Approach • During evolution, the substitution of similar amino acids is more likely to be selected within a protein family of similar sequences than random substitutions (M. Dayhoff) • The frequency / probability that one residue substitutes another one is an indicator of their similarity.

  34. PAM Scoring Matrices(M. Dayhoff) • Select a number of protein families. • Align sequences in each family and count the frequency of amino acid substitution in each column: Pij. • Similarity score is logarithm of the ratio of observed substitution probability over the random substitution probability. S(i,j) = log(Pij / (Pi * Pj)). Pi is the observed probability of residue i and Pj is the observed probability of residue j • PAM: Point Accepted Mutation • Let data tell strategy! - NBA

  35. A Simplified Example Substitution Frequency Table ACGTCGAGT ACCACGTGT CACACTACT ACCGCATGA ACCCTATCT TCCGTAACA ACCATAAGT AGCATAAGT ACTATAAGT ACGATAAGT Total number of substitutions: 90 P(A<->C) = 0.07+0.07=0.14

  36. A Simplified Example Substitution Frequency Table ACGTCGAGT ACCACGTGT CACACTACT ACCGCATGA ACCCTATCT TCCGTAACA ACCATAAGT AGCATAAGT ACTATAAGT ACGATAAGT Total number of substitutions: 90 P(A<->C) = 0.07+0.07=0.14 S(A,C) = log(0.14/(0.6*0.1)) = 0.36

  37. PAM250 Matrix (log odds multiplied by 10)

  38. BLOSUM Matrices(Henikoff and Henikoff) • PAM calculation is based on global alignments • PAM matrices don’t work well for aligning sequences with little similarity. • BLOSUM: BLOcks SUbstitution Matrix • BLOSUM based on highly conserved local regions /blocks without gaps.

  39. Block 1 Block2

  40. BLOSUM62 Matrix

  41. Significance of Sequence Alignment Score • Why do we need significant test? • Mathematical view: unusual versus “by chance” • Biological view: evolutionary related or not?

  42. Randomization Approach • Randomization is a fundamental idea due to a statistician Fisher. • Randomly permute chars within sequence P and Q to generate new sequences (P’ and Q’). Align random sequences and record alignment scores. • Assuming these scores obey normal distribution, compute mean (u) and standard derivation (σ) of alignment scores

  43. 95% Alignment score Normal distribution of alignment scores of two sequences • If S = u+2 σ, the probability of observing the alignment score equal to or more • extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%. • Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant. • For any score x, we can compute P(S >= x), which is called p-value.

  44. Model-Based Approach (Karlin and Altschul)http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html • Extreme Value Distribution K and λare statistical parameters depending on substitution matrix. For BLOSUM62, λ =0.252, K=0.35

  45. P-Value • P(S≥x) is called p-value. It is the probability that two random sequences has an alignment score >= x. • Smaller means more significant. • E-value = p-value * # sequences

  46. Problems of Using Dynamic Programming to Search Large Sequence Databases • Search homologs in DNA and protein database is often the first step of a bioinformatics study. • DP is too slow for large sequence database search such as Genbank and UniProt. Each DP search can take hours. • Most DP search time is wasted on unrelated sequences or dissimilar regions. • Developing fast, practical sequence comparison methods for database search is important.

  47. Fast Sequence Search Methods • All successful, rapid sequence comparison methods are based on a simple fact: similar sequences share some common words. • First such method is FASTP (Pearson & Lipman, 1985) • Most widely used methods are BLAST (Altschul et al., 1990) and PSI-BLAST (Altschul et al., 1997).

More Related