1 / 53

Probabilistic Sequence Alignment

Probabilistic Sequence Alignment. BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014. What you’ve seen thus far. The pairwise sequence alignment task Notion of a “best” alignment – scoring schemes

shandi
Download Presentation

Probabilistic Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014

  2. What you’ve seen thus far • The pairwise sequence alignment task • Notion of a “best”alignment – scoring schemes • Dynamic programming algorithms for efficiently finding a “best” alignment • Variants on the task • global, local, different gap penalty functions • Heuristic methods for large-scale alignment - BLAST

  3. Tasks addressed today • How can we express the uncertainty of an alignment? • How can we estimate the parameters for alignment? • How can we align multiple sequences to each other?

  4. Alignment summary: 27 mismatches, 12 gaps, 116 spaces Alignment summary: 45 mismatches, 4 gaps, 214 spaces Picking Alignments mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA------------ pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** mel -------CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG--------------------CGCAGTCGGCGGCGGCGGC pse CATTCGGCATGTGAAAA-------------TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** ** mel GCGCAGTCAGC----------GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG--------------- pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * * mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC pse GCCAAAGAGAGCGATTTTATTTATGTG-----------------ACTGCGCTGCCTG----------GTCCTCGGC ********** ************* ** **** * * * * * ** * Alignment 1 mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATC pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * * mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAAT pse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC------------------------------------------------- * ** * * *************** * mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTG pse --------------------------------------------GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * **** mel GCAGCGA----------------------------------------------------------------------------- pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** * mel ----------------------------------CTGCGAC pse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** * Alignment 2

  5. Antp: antennapedia >chr3R:2824911-2825170 TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAAT TCAAAGCATATACATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAGCGCAGTCGGC GGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACT CGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCA GCACTGGCAGCGACTGCGAC Adf1→Antp:06447: Binding site for transcription factor Adf1 An Elusive Cis-Regulatory Element Drosophila melanogaster polytene chromosomes

  6. TGTGCGTCAGCGTCGGCCGCAACAGCG TGTGCGCCAGCGTCAGCGCCAGCGCCG ****** ******* ** ** * ** 74% identity TGTGCGTCAGCGTCGGCCGCAACAGCG TGTG-----------------ACTGCG **** ** *** 33% identity Alignment summary: 27 mismatches, 12 gaps, 116 spaces Alignment summary: 45 mismatches, 4 gaps, 214 spaces The Conservation of Adf1→Antp:06447 mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA------------ pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** mel -------CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG--------------------CGCAGTCGGCGGCGGCGGC pse CATTCGGCATGTGAAAA-------------TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** ** mel GCGCAGTCAGC----------GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG--------------- pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * * mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC pse GCCAAAGAGAGCGATTTTATTTATGTG-----------------ACTGCGCTGCCTG----------GTCCTCGGC ********** ************* ** **** * * * * * ** * Alignment 1 mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATC pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * * mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAAT pse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC------------------------------------------------- * ** * * *************** * mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTG pse --------------------------------------------GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * **** mel GCAGCGA----------------------------------------------------------------------------- pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** * mel ----------------------------------CTGCGAC pse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** * Alignment 2

  7. TGTGCGTCAGCGTCGGCCGCAACAGCG TGTG-----------------ACTGCG **** ** *** TGTGCGTCAGCGTCGGCCGCAACAGCG TGTGCGCCAGCGTCAGCGCCAGCGCCG ****** ******* ** ** * ** The Polytope Sequence lengths = 260bp, 280bp 364 Vertices 760 Ridges 398 Facets

  8. Methodological machinery to be used • Hidden Markov models (HMMs) • Viterbi and Forward algorithms • Profile HMMs • Pair HMMs • Connections to classical sequence alignment

  9. Hidden Markov models • Generative probabilistic models of sequences • Explicitly models unobserved (hidden) states that “emit” the characters of the observed sequence • Primary task of interest is to infer the hidden states given the observed sequence • Alignment case: hidden states = alignment

  10. Two HMM random variables • Observed sequence • Hidden state sequence • HMM: • Markov chain over hidden sequence • Dependence between

  11. The Parameters of an HMM • as in Markov chain models, we have transition probabilities • since we’ve decoupled states and characters, we also have emission probabilities probability of a transition from state k to l represents a path (sequence of states) through the model probability of emitting character b in state k

  12. 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 begin end 0.6 0.5 1 3 0 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 A Simple HMM with Emission Parameters probability of a transition from state 1 to state 3 probability of emitting character A in state 2 0.8

  13. Three Important Questions • How likely is a given sequence? the Forward algorithm • What is the most probable “path” (sequence of hidden states) for generating a given sequence? the Viterbi algorithm • How can we learn the HMM parameters given a set of sequences? the Forward-Backward (Baum-Welch) algorithm

  14. How Likely is a Given Path and Sequence? • the probability that the path is taken and the sequence is generated: (assuming begin/end are the only silent states on path)

  15. begin end How Likely Is A Given Path and Sequence? 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 1 3 0 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8

  16. How Likely is a Given Sequence? • We usually only observe the sequence, not the path • To find the probability of a sequence, we must sum over all possible paths • but the number of paths can be exponential in the length of the sequence... • the Forward algorithm enables us to compute this efficiently

  17. How Likely is a Given Sequence: The Forward Algorithm • A dynamic programming solution • subproblem: define to be the probability of generating the first i characters and ending in state k • we want to compute , the probability of generating the entire sequence (x) and ending in the end state (state N) • can define this recursively

  18. 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 begin end 0.5 1 3 0 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8 The Forward Algorithm • because of the Markov property, don’t have to explicitly enumerate every path • e.g. compute using

  19. The Forward Algorithm • initialization: probability that we’re in the start state and have observed 0 characters from the sequence

  20. The Forward Algorithm • recursion for emitting states (i =1…L): • recursion for silent states:

  21. The Forward Algorithm • termination: probability that we’re in the end state and have observed the entire sequence

  22. 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 1 3 begin end 0 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8 Forward Algorithm Example • given the sequence x = TAGA

  23. Forward Algorithm Example • given the sequence x = TAGA • initialization • computing other values

  24. Three Important Questions • How likely is a given sequence? • What is the most probable “path”for generating a given sequence? • How can we learn the HMM parameters given a set of sequences?

  25. Finding the Most Probable Path: The Viterbi Algorithm • Dynamic programming approach, again! • subproblem: define to be the probability of the most probable path accounting for the first i characters of x and ending in state k • we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state • can define recursively • can use DP to find efficiently

  26. Finding the Most Probable Path: The Viterbi Algorithm • initialization:

  27. The Viterbi Algorithm • recursion for emitting states (i =1…L): keep track of most probable path • recursion for silent states:

  28. The Viterbi Algorithm • termination: • traceback: follow pointers back starting at

  29. begin end Forward & Viterbi Algorithms • Forward/Viterbi algorithms effectively consider all possible paths for a sequence • Forward to find probability of a sequence • Viterbi to find most probable path • consider a sequence of length 4…

  30. HMM parameter estimation • Easy if the hidden path is known for each sequence • In general, the paths are unknown • Baum-Welch (Forward-Backward) algorithm is used to compute maximum likelihood estimates • Backward algorithm is analog of forward algorithm for computing probabilities of suffixes of a sequence

  31. Learning Parameters: The Baum-Welch Algorithm • algorithm sketch: • initialize the parameters of the model • iterate until convergence • calculate the expected number of times each transition or emission is used • adjust the parameters to maximize the likelihood of these expected values

  32. How can we use HMMs for pairwise alignment? • What is the observed sequence? • one of the two sequences? • both sequences? • What is the hidden path? • the alignment

  33. Profile HMM for pairwise alignment • Select one sequence to be observed (the query) • The other sequence (the reference) defines the states of the HMM • Three classes of states • Match: corresponds to aligned positions • Delete: positions of the reference that are deleted in the query • Insert: positions on the query that are insertions relative to the reference

  34. A 0.01 R 0.12 D 0.04 N 0.29 C 0.01 E 0.03 Q 0.02 G 0.01 d d d 3 1 2 i i i i 1 2 3 0 start end Insert and match states have emission distributions over sequence characters m m 2 m 3 1 Profile HMMs Delete states are silent; they Account for characters missing in some sequences Insert states account for extra characters in some sequences Match states represent key conserved positions

  35. Example Profile HMM insert states delete states (silent) Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences match states

  36. Profile HMM considerations • Odd asymmetry: have to pick one sequence as reference • Models conditional probability P(X|Y) of query sequence (X) given reference sequence (Y) • Is there something more natural here? • Yes, Pair HMMs • We will revisit Profile HMMs for multiple alignment a bit later

  37. Pair Hidden Markov Models • each non-silent state emits one or a pair of characters H: homology (match) state I: insert state D: delete state

  38. PHMM Paths = Alignments sequence 1: AAGCGC sequence 2: ATGTC B H A A H A T I G I C H G G D T H C C E hidden: observed:

  39. Transition Probabilities • probabilities of moving between states at each step state i+1 state i

  40. Emission Probabilities Deletion (D) Insertion (I) Homology (H) single character single character pairs of characters

  41. Pair HMM Viterbi • probability of most likely sequence of hidden states generating length i prefix of x and length j prefix of y, with the last state being: H I D • note that the recurrence relations here allow ID and DI transitions

  42. PHMM Alignment • calculate probability of most likely alignment • traceback, as in Needleman-Wunsch (NW), to obtain sequence of state states giving highest probability HIDHHDDIIHH...

  43. Correspondence with Needleman-Wunsch (NW) • NW values ≈ logarithms of Pair HMM Viterbi values

  44. Posterior Probabilities • there are similar recurrences for the Forward and Backward values • from theForward andBackwardvalues, we can calculate the posterior probability of the event that the path passes through a certain state S, after generating length i and j prefixes

  45. Uses for Posterior Probabilities • sampling of suboptimal alignments • posterior probability of pairs of residues being homologous (aligned to each other) • posterior probability of a residue being gapped • training model parameters (EM)

  46. Posterior Probabilities plot of posterior probability of each alignment column

  47. Parameter Training • supervised training • given: sequences and correct alignments • do: calculate parameter values that maximize joint likelihood of sequences and alignments • unsupervised training • given: sequence pairs, but no alignments • do: calculate parameter values that maximize marginal likelihood of sequences (sum over all possible alignments)

  48. Multiple Alignment with Profile HMMs • given a set of sequences to be aligned • use Baum-Welch to learn parameters of model • may also adjust length of profile HMM during training • to compute a multiple alignment given the profile HMM • run the Viterbi algorithm on each sequence • Viterbi paths indicate correspondences among sequences

  49. Multiple Alignment with Profile HMMs

  50. More common multiple alignment strategy: Progressive alignment -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC -TGTAAC -TGT-AC ATGT--C ATGTGGC TGTAAC TGT-AC ATGT--C ATGTGGC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

More Related