1 / 38

Sequence Similarity

Sequence Similarity. x i. ―. x i. y j. MATCH. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins. INSERT X. INSERT Y. ―. y j. x i. y j. MATCH. INSERT X. INSERT Y. x i. ―. ―. y j. A pair-HMM model of pairwise alignment.

carl
Download Presentation

Sequence Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Similarity

  2. xi ― xi yj MATCH PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERT X INSERT Y ― yj

  3. xi yj MATCH INSERT X INSERT Y xi ― ― yj A pair-HMM model of pairwise alignment • Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences • Transition probabilities ~ gap penalties • Emission probabilities ~ substitution matrix (from BLOSUM) x ABRACA-DABRA AB-ACARDI--- y

  4. Computing Pairwise Alignments • The Viterbi algorithm • conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y • identifies highest probability alignment, αviterbi, in O(L2) time Caveat: the mostlikely alignment is not the mostaccurate • Alternative: find the alignment of maximum expected accuracy P(α) αviterbi P(α | x, y)

  5. 4. F 4. T 4. F 4. F 4. F B A- A A- A 4. F 4. F 4. T 4. F 4. F B- B+ B+ B- C The Lazy-Teacher Analogy • 10 students take a 10-question true-false quiz • How do you make the answer key? • Approach #1: Use the answer sheet of the best student! • Approach #2: Weighted majority vote!

  6. Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes Eα*[1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes Eα*[accuracy(α, α*)] 4. T A Viterbi vs. Maximum Expected Accuracy (MEA) 4. F 4. F 4. T 4. F 4. F B A- A A- A 4. F 4. F 4. F 4. F 4. T C B- B+ B+ B-

  7. Computing MEA alignments • Define accuracy (α, α*) = Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y) = ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’) = ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’) = ∑(xi, yj) in α P(xi, yj in α’ | x, y) • Define M[i, j] = posterior probability that xi is aligned to yj # of correct predicted matches length of shorter sequence

  8. Computing MEA alignments • Define accuracy (α, α*) = • Then, MEA alignment is highest summing path through the matrix M[i, j] = P(xi is aligned to yj | x, y) • M[i, j] = posterior probability that xi is aligned to yj • Can compute with forward, backward dynamic programming in O(L2) time # of correct predicted matches length of shorter sequence

  9. Computing MEA alignments • Define accuracy (α, α*) = • Then, MEA alignment is highest summing path through the matrix M[i, j] = P(xi is aligned to yj | x, y) • M[I, j] = posterior probability that xi is aligned to yj • Can compute with forward, backward dynamic programming in O(L2) time # of correct predicted matches length of shorter sequence

  10. The consistency signal zk z xi x y yj yj’

  11. To estimate P(xiyj | x, y, z) Method 1:triplet-HMM P(xi ~ yj | x, y, z) = ∑k P(xi~yj~zk | x, y, z) Parameters trained with unsupervised EM Running time: O(N3L3) N: # sequences L: sequence lengths

  12. Probabilistic consistency • Compute P(xi is aligned to yj | x, y) P(xi is aligned to yj | x, y, z) • 2 approaches: • 1) Exact – triplet HMM, O(L3) time • 2) Approximate – use independence assumptions ∑k P(xi ~ zk and zk ~ yj | x, y, z) = ∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk)  (assume indep.) ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

  13. Probabilistic consistency • Compute P(xi is aligned to yj | x, y, z) To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y) Notice that for any given i, most entries k and j will be close to 0 -- sparse matrices Pxy|z PxzPzy Finally, let Pxy|S 1/|S|∑z in S PxzPzy

  14. ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRACADABRA ABRA--DABI- AB-ACARDI--- ABRA---DABI- Multiple sequence alignment • A straightforward generalization • sum-of-pairs • tree-based progressive alignment • iterative refinement

  15. ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRACADABRA ABRA--DABI- AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARD--I- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABACARDI ABRACADABRA ABACARDI ABRADABI Multiple sequence alignment • A straightforward generalization • sum-of-pairs • tree-based progressive alignment • iterative refinement

  16. Summary of PROBCONS Algorithm • Given K sequences to be aligned, • Compute M[i, j] for all pairs of sequences, x and y • (2) Use probabilistic consistency to reestimate M[i, j] • Build a tree of the sequences by connecting closest first • “Closest” defined according to expected accuracy • EA(x, y) = E(accuracy) of MEA alignment of x and y • Perform progressive alignment along the tree • Score of a column: sum-of-pairs M[i, j] • Apply iterative refinement

  17. Training/testing methodology • 3 reference benchmark sets • PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE. • Quality score: Q(α, α*) = BAliBASE PREFAB SABmark # of correct predicted matches total # of true matches

  18. Evaluation of Algorithm Components all-pairs pairwise multiple

  19. Performance of different alignment tools

  20. Resources for alignment Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used (1994) http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable (2004) http://probcons.stanford.edu/ PROBCONS – most accurate (2004) Some more protein multiple aligners: MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc……. ProbCons: Chuong (Tom) Do

  21. Profile hidden Markov models for sequence famillies

  22. PFAM Protein FAMilies database of alignments • Profile HMMs describe each family • For each family in Pfam you can: • Look at multiple alignments • View protein domain architectures • Examine species distribution • Follow links to other databases • View known protein structures

  23. PFAM Pfam-A – curated multiple alignments • Grows slowly; quality controlled by experts Pfam-B – automatic clustering (ProDom derived) • New sequences instantly incorporated; unchecked • Search by: Sequence, keyword, domain, taxonomy • Browsing by family or genome • Evolutionary tree • Source of seed alignments: • Pfam-B families • Published articles • ‘Domain hunting' studies

  24. Profile HMMs • Each M state has a position-specific pre-computed substitution table • Each I state has position-specific gap penalties (and in principle can have its own emission distributions) • Each D state also has position-specific gap penalties • In principle, D-D transitions can also be customized per position Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F

  25. Profile HMMs • transition between match states – αM(i)M(i+1) • transitions between match and insert states – αM(i)I(i), αI(i)M(i+1) • transition within insert state – αI(i)I(i) • transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1) • transition within delete state – αD(i)D(i+1) • emission of amino acid b at a state S – εS(b) Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F

  26. Profile HMMs • transition probabilities ~ frequency of a transition in alignment • emission probabilities ~ frequency of an emission in alignment • pseudocounts are usually introduced Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F

  27. Alignment of a protein to a profile HMM To align sequence x1…xn to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm • Define • VjM(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj • VjI(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Ij • VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is the last character emitted before Dj) • Denote by qa the frequency of amino acid a in a ‘random’ protein

  28. Alignment of a protein to a profile HMM Vj-1M(i – 1) + log αM(j-1)M(j) • VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1I(i – 1) + log αI(j-1)M(j) Vj-1D(i – 1) + log αD(j-1)M(j) VjM(i – 1) + log αM(j)I(j) • VjI(i) = log (εI(j)(xi) / qxi) + max VjI(i – 1) + log αI(j)I(j) VjD(i – 1) + log αD(j)I(j) Vj-1M(i) + log αM(j-1)D(j) • VjD(i) = max Vj-1I(i) + log αI(j-1)D(j) Vj-1D(i) + log αD(j-1)D(j)

  29. Weight of each sequence i h g f • One simple weighting scheme is to find how much edge length each leaf contributes • Example: edge 1 belongs to a • Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a Δwi = ecurrent wi / (leaves k below ecurrentwk) e d c 2 b 3 1 a

  30. How to build a profile HMM

  31. Resources on the web • HMMer – a free profile HMM software • http://hmmer.wustl.edu/ • SAM – another free profile HMM software • http://www.cse.ucsc.edu/research/compbio/sam.html • PFAM – database of alignments and HMMs for protein families and domains • http://www.sanger.ac.uk/Software/Pfam/ • SCOP – a structural classification of proteins • http://scop.berkeley.edu/data/scop.b.html

More Related