1 / 32

Sequence Alignments Some key terms and concepts : Indel (insertion and deletion)

Sequence Alignments Some key terms and concepts : Indel (insertion and deletion) Pairwise sequence alignment ( eg . Blast) versus multiple sequence alignment ( eg . Clustal ) Scoring models and matrices (applies to both amino-acid and nucleotide sequences)

mills
Download Presentation

Sequence Alignments Some key terms and concepts : Indel (insertion and deletion)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignments Some key terms and concepts: Indel (insertion and deletion) Pairwise sequence alignment (eg. Blast) versus multiple sequence alignment (eg. Clustal) Scoring models and matrices (applies to both amino-acid and nucleotide sequences) Alignment formats (fasta, Clustal, Phylip, Nexus) Pairwise clustering (eg. UPGMA) Format conversion tools Interleaved vs sequential format

  2. Sequence Alignments • Two goals: • 1.  Similarity searches to identify homologs. • fastaBlast • 2.  Creation of multiple sequence alignments for comparative analysis (phylogenetics, structure-function, etc). • Clustal and others

  3. Pairwise alignments Optimize pairwise alignment based on some scoring scheme • Some famous methods     FASTA (Pearson and Lipman)    Blast     Smith-Waterman

  4. Origin of FASTA

  5. From W.R. Pearson, 1994

  6. FASTA format Sequence name preceded by “>” and followed by a hard return >SeqA GATCGCGTTTCCC >SeqB GATCGATTTCCC >SeqC GATCGGATTTCCC Sequence followed by a hard return

  7. Two typesof alignments • 1.  Global • assume homology over entire sequence • 2.  Local • look for windows of similarity

  8. The issue:  Comparative analysis of homologous nucleotide and amino-acid sequences requires an accurate alignment of sequences. • The problem: Construction of alignments is complicated by indels (insertions and deletions).  At issue is where the indels get placed in the alignment. • Approaches to solving this problem can be manual (involving subjective decisions of the person making the alignment) or algorithmic (almost always employing a computer).  Often, alignments are created by computer and then refined by the scientist.

  9. Species 1:  3510188 CTGATCCGAGGTCAACCTTGGGTT-GTGAAGGTCGTTTTACGGCTGGAAC 3510237                |||||||||||||||||||||| | | ||||||||||||||||||||||| species 2:      562 CTGATCCGAGGTCAACCTTGGGGTCGCGAAGGTCGTTTTACGGCTGGAAC 513

  10. But decisions need to be made Easy Seq A ---GATCGAGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGAGTTTCCC--- Seq B ---GATCGA-TTTCCC---

  11. Less Easy (this one needs gap plus mismatch) Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGCGTTTCCC--- Seq B ---GATCG-ATTTCCC--- OR? Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGA-TTTCCC---

  12. Computationally-based methods use scoring procedures For example, a simple nucleotide scoring matrix such as the following:       A  G  T  C  -   A  1  0  0  0  0   G  0  1  0  0  0   T  0  0  1  0  0   C  0  0  0  1  0   -  0  0  0  0  0 Can refine nucleotide substitution scoring if it is known that certain kinds of substitutions are more common (for example, transitions are frequently more common that transversions) (amino-acid sequence alignments usually employ a PAM-based or BLOSUM-based scoring matrix. Can also score based on minimum number of mutational steps from one aa to another) Gaps (indels) typically carry a penalty and are limited in terms of allowed length.

  13. Scoring is model dependent

  14. Less Easy (this one needs gap plus mismatch) Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGCGTTTCCC--- Seq B ---GATCG-ATTTCCC--- OR? Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGA-TTTCCC---

  15. Clustal scoring procedures seem to be either simple (as described earlier) or IUB (1.9 for match, 0 for mismatch, match for identity, X or N). And see below.

  16. Computationally-based methods use scoring procedures For example, a simple nucleotide scoring matrix such as the following:       A  G  T  C  -   A  1  0  0  0  0   G  0  1  0  0  0   T  0  0  1  0  0   C  0  0  0  1  0   -  0  0  0  0  0 Can refine nucleotide substitution scoring if it is known that certain kinds of substitutions are more common (for example, transitions are frequently more common that transversions) Amino-acid sequence alignments usually employ a PAM-based or BLOSUM-based scoring matrix. Can also score based on minimum number of mutational steps from one aa to another Gaps (indels) carry a penalty and are limited in terms of allowed length.

  17. A nice Goal: Consider every possible alternative alignment (and find the best) • For example, consider all alignment possibilities for two 2-nucleotide sequences (AG vs AA) assuming at least one homologous position. AG AG- AG- A-G A-G -AGAA -AA -AA -AA AA- AA-

  18. Computationally, things get complicated quickly.  Consider some possibilities for two 3-base sequences (AGC vs AAC).  Note that anything goes as long as gaps are not lined up together. AGC -AGC AGC- -AGC -AGC AGC- AAC AAC- -AAC AA-C A-AC A-AC AGC- A-GC A-GC A-GC -A-GC -A-GC AA-C AAC- -AAC AA-C AAC-- A-AC- A-GC- A-G-C -AG-C --AGC AGC-- AA-C- -AAC- AA-C- AAC-- --AAC

  19. So, the number of possibilities is greater than 2N (N = length of sequence).  This means that the number of alignment possibilities for two 300-amino-acid long proteins is greater than 2300 which is approximately equal to 1090.  There may be 1080 elementary particles in the known universe.  Therefore, no computer can consider every pairwise alignment possibility. • Then how does one find the optimal alignment?  Solution: dynamic programming methods can eliminate search pathways that will not lead to identifying the optimal (best score) alignment.

  20. Multiple Sequence Alignments

  21. Multiple sequence alignment (Clustal format interleaved)

  22. 20 372 ThNM012b TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM012 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM043 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThlanugQH0 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM069 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM070 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM037 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM076 TTCCGCCGGG GGGGTNGTCC CNNGGCTCGG TGTGCCCCCG GGGCCCGTGC ThNM032 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM075 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM007 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC Talthermo CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM073 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC ThNM002 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC AfumHQ6310 GGCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM0026A -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM025a -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC Aspni5 -GCCGCCGGG GGGGCGCCTC TGC------- -----CCCCC GGGCCCGTGC ThNM001 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC ThaurantT8 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC Multiple sequence alignment (Phylip format)

  23. Multiple sequence alignment (Phylip format interleaved)

  24. Multiple sequence alignment (Phylip format sequential)

  25. Multiple sequence alignment (Nexus format interleaved)

  26. Multiple sequence alignment ClustalW (and others) Steps 1.  pairwise alignments 2.  UPGMA or Neighbor-Joining tree based on pairwise scores (guide tree) 3.  Multiple alignment informed by guide tree

  27. There is a problem with large alignments, which are common in the genomics age The number of pairwise comparisons for a given set of sequences (N) is N(N-1)/2 Example: 4(3)/2 = 6 1: C1_3006100.0098.3697.9996.73 2: C8_9737 100.00 95.33 95.93 3: C2_4297 100.00 98.78 4: C4_3894 100.00 What if 100,000 sequences? 100,000(99,999)/2 ≈ 5 billion (many days of computer time)

  28. Multiple sequence alignment Clustal Omega and others Steps 1.  Instead of a matrix of all pairwise comparisons, start with a set of “reference” sequences that are aligned to the remaining sequences 2. UPGMA or Neighbor-Joining tree based on pairwise scores (guide tree) 3. Multiple alignment informed by guide tree

  29. Some useful resources • Clustal Omega analysis European Bioinformatics Institute(https://www.ebi.ac.uk/Tools/msa/clustalo) • ClustalW analysis UNM CETI Galaxy site(http://emil.unm.edu/galaxy) • ClustalX download (http://www.clustal.org/clustal2)

More Related