1 / 43

Multiple s equence alignment and their reliability

Multiple s equence alignment and their reliability. The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/. What are alignments good for?. To compare sequences Find homology

Download Presentation

Multiple s equence alignment and their reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/ TAU Bioinformatics Workshop

  2. What are alignments good for? • To compare sequences • Find homology • Similar sequence  similar function • To learn about sequence evolution • Mismatch = point mutation • Gap = indel (insertion or deletion) • Reconstruct phylogenetic tree • Infer selection forces, e.g., detecting positive selection, co-evolving sites • For structure prediction • Similar regions potentially have similar structure

  3. Making an alignment (pairwise) • For 2 sequences – Pairwise alignment • Local alignment – finds regions of high similarity in parts of the sequences. • Global alignment – finds the best alignment across the entire two sequences ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN CDRYYQ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Use exact solution • Needleman-Wunsch (for global) or Smith-Waterman (for local) - http://www.ebi.ac.uk/Tools/psa/

  4. Sequences evolution ATGAAATAA 30 MYA ATGTTTTAA ATGCCCAAATAA 5 MYA ATGCCCAAA ATGTTTTCA ATGTTTTAA Today

  5. Alignment and phylogeny are mutually dependent MSA Unaligned sequences Sequence alignment Phylogeny reconstruction Inaccurate tree building

  6. Alignment and phylogeny are both challenging ~25% of residues are wrongly aligned Based on BAliBASE: a large representative set of proteins

  7. Alignment and phylogeny are both challenging 5% of tree branches are wrong Based on simulations of 100 protein sequences

  8. Making an alignment (MSA) • For more sequences - Multiple sequence alignment (MSA) • Exact methods are not feasible (too slow) • We use heuristic methods • Several advanced MSA programs are availableBasically two recommended methods: • MAFFT – fastest and one of the most accurate • PRANK – distinct from all other MSA programs because of its correct treatment of insertions/deletions

  9. Progressive alignment A B C D E First step: compute pairwise distances Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table

  10. A B C D E Second step: build a guide tree • Cluster the sequences to create a tree (guide tree): • represents the order in which pairs of sequences are to be aligned • similar sequences are neighbors in the tree • distant sequences are distant from each other in the tree The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

  11. Sequence A Sequence B Sequence C Sequence D Sequence E A B C D E Third step: align sequences in a bottom up order • Align the most similar (neighboring) pairs • Align pairs of pairs • Align sequences clustered to pairs of pairs deeper in the tree

  12. Multiple sequence alignment (MSA) A B C DE Pairwise distance table Guide tree A MSA B C D E progressivealignment Iterative

  13. Sources of alignment errors Progressive alignment algorithms are greedy heuristics • Co-optimal solutions  Heads-or-Tails (HoT) scores (Landan & Graur 2007) GEELTNWPSPVCHNRLASGIDDSTAFRFPRPQKWIISYSLHCVI... GEELTLWPSPVCHNRLASGIDASIAFRFPRAQKRFYRYSLHCVI... TEELTHWPFPVCRNRLARGIGSAIAFRCPRSQEHI-RNSLPCVI... TEELRYWPFPVCQN--ARGNGSVIEARNPGSQ-----KVLPYVI... ...IVCHLSYSIIWKQPRPFRFATSDDIGSALRNHCVPSPWNTLEEG ...IVCHLSYRYFRKQARPFRFAISADIGSALRNHCVPSPWLTLEEG ...IVCPLSNRI-HEQSRPCRFAIASGIGRALRNRCVPFPWHTLEET ...IVYPLVK-----QSGPNRAEIVSGNGRA--NQCVPFPWYRLEET

  14. GUIDANCE: Guide-tree based alignment confidence scores Tree 1 Tree 2 … Tree 99 Tree 100 MSA 1 MSA 2 … MSA 99 MSA 100 1 0 Base alignment Bootstrap sampling of NJ trees Progressive alignment GUIDANCE Scores Penn, Privman et al. MBE. 2010

  15. Comparing alignments Common measures to quantify distance between two MSAs: CS:Each column of the MSA that is identically aligned in the other MSA is given a score of 1; all other columns are given the score 0. SP: Each pair of residues in the MSA that is identically aligned in the other MSA is given a score of 1; all other residue pairs are given the score 0. Sum-of-pairs column score (SPC): The score of each column is simply the average of the SPs over all pairs in it.

  16. Accuracy of GUIDANCE scores

  17. As a rule of thumb, use HoT for less than 8 sequences http://guidance.tau.ac.il

  18. http://guidance.tau.ac.il Un-aligned sequences (FASTA format) Choose sequence type Choose alignment method

  19. GUIDANCE results MSA colored by confidence score Footer Text

  20. GUIDANCE results Confident Uncertain Sequence score Column score

  21. GUIDANCE outputs Download MSA for down-stream analysis Text files with all scores Mask residue by score Remove unreliable sequences

  22. GUIDANCE results Confident Uncertain Sequence score Column score

  23. GUIDANCE outputs Remove unreliable sequences Sequences left after filtration Re-align sequences after filtration

  24. Filtering sequences with low scores and re-align But always remember not to remove too much data and consider the biology…

  25. GUIDANCE outputs Remove unreliable columns MSA after filtration

  26. Filtering columns with low scores

  27. GUIDANCE outputs Masking unreliably aligned residues

  28. Filtering residues with low scores

  29. Filtering unreliable regions can improve down-stream analysis (Mol Biol Evol 2012;29:1-5)

  30. Acknowledgments • Prof. Tal Pupko • Dr. EyalPrivman • Dr. Osnat Penn • Pupko’s lab members • Penn, O., Privman, E., Ashkenazy, H., Landan, G., Graur, D. and Pupko, T. (2010).GUIDANCE: a web server for assessing alignment confidence scores.Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443[ABS][PDF] • Penn, O., Privman, E., Landan, G., Graur, D. and Pupko, T. (2010).An alignment confidence score capturing robustness to guide-tree uncertainty.Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066 [ABS][PDF] • Landan, G., and D. Graur. (2008).Local reliability measures from sets of co-optimal multiple sequence alignments.Pac SympBiocomput 13:15-24[ABS][PDF]

  31. Thanks for your attention!

  32. Download and save the sequences file. (http://guidance.tau.ac.il/workshop_2013/) "Seq_For_GUIDANCE.fs" (File  “Save as”). This file contains 20 protein sequences in FASTA format. • Run GUIDANCE web-server to create a protein alignment: • Use GUIDANCE algorithm • Select “amino acids” as the sequences type; • Select MAFFT as the alignment method • Run (press the “Submit“ button) . • (In case it does not run for you, you can see the results at: http://guidance.tau.ac.il/results/13589321556364/output.php) • What is the alignment score? What does it mean about the alignment achieved? • Which sequences can be removed to improve the alignment? What is the biological justification for that? Try it! 

  33. Appendix – MSA servers

  34. MAFFT • Web server & download: http://mafft.cbrc.jp/alignment/server/

  35. Choosing a MAFFT strategy quick & dirtyslow but accurate • Efficiency-tuned variants quick & dirty or slow but accurate

  36. Choosing a MAFFT strategy quick & dirtyslow but accurate

  37. Choosing a MAFFT strategy quick & dirtyslow but accurate

  38. Choosing a MAFFT strategy quick & dirtyslow but accurate • E-INS-i • oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo • ---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------- • -----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo • ---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX------------- • ---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-------- G-INS-i XXXXXXXXXXX-XXXXXXXXXXXXXXX XX-XXXXXXXXXXXXXXX-XXXXXXXX XXXXX----XXXXXXXX---XXXXXXX XXXXX-XXXXXXXXXX----XXXXXXX XXXXXXXXXXXXXXXX----XXXXXXX L-INS-i ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------ --------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo------- ------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo------- --------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo --------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------

  39. MAFFT output Choose a format: Clustal, Fasta and save as text file A colored view of the alignment Run GUIDANCE also from here!!

  40. PRANK

  41. PRANK will generally make less mistakes that affect phylogeny reconstruction and other analyses Classical alignment errors for HIV env

  42. PRANK • Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/

  43. PRANK output If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/

More Related