320 likes | 576 Views
Sequence Alignments Some key terms and concepts : Indel (insertion and deletion) Pairwise sequence alignment ( eg . Blast) versus multiple sequence alignment ( eg . Clustal ) Scoring models and matrices (applies to both amino-acid and nucleotide sequences)
E N D
Sequence Alignments Some key terms and concepts: Indel (insertion and deletion) Pairwise sequence alignment (eg. Blast) versus multiple sequence alignment (eg. Clustal) Scoring models and matrices (applies to both amino-acid and nucleotide sequences) Alignment formats (fasta, Clustal, Phylip, Nexus) Pairwise clustering (eg. UPGMA) Format conversion tools Interleaved vs sequential format
Sequence Alignments • Two goals: • 1. Similarity searches to identify homologs. • fastaBlast • 2. Creation of multiple sequence alignments for comparative analysis (phylogenetics, structure-function, etc). • Clustal and others
Pairwise alignments Optimize pairwise alignment based on some scoring scheme • Some famous methods FASTA (Pearson and Lipman) Blast Smith-Waterman
FASTA format Sequence name preceded by “>” and followed by a hard return >SeqA GATCGCGTTTCCC >SeqB GATCGATTTCCC >SeqC GATCGGATTTCCC Sequence followed by a hard return
Two typesof alignments • 1. Global • assume homology over entire sequence • 2. Local • look for windows of similarity
The issue: Comparative analysis of homologous nucleotide and amino-acid sequences requires an accurate alignment of sequences. • The problem: Construction of alignments is complicated by indels (insertions and deletions). At issue is where the indels get placed in the alignment. • Approaches to solving this problem can be manual (involving subjective decisions of the person making the alignment) or algorithmic (almost always employing a computer). Often, alignments are created by computer and then refined by the scientist.
Species 1: 3510188 CTGATCCGAGGTCAACCTTGGGTT-GTGAAGGTCGTTTTACGGCTGGAAC 3510237 |||||||||||||||||||||| | | ||||||||||||||||||||||| species 2: 562 CTGATCCGAGGTCAACCTTGGGGTCGCGAAGGTCGTTTTACGGCTGGAAC 513
But decisions need to be made Easy Seq A ---GATCGAGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGAGTTTCCC--- Seq B ---GATCGA-TTTCCC---
Less Easy (this one needs gap plus mismatch) Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGCGTTTCCC--- Seq B ---GATCG-ATTTCCC--- OR? Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGA-TTTCCC---
Computationally-based methods use scoring procedures For example, a simple nucleotide scoring matrix such as the following: A G T C - A 1 0 0 0 0 G 0 1 0 0 0 T 0 0 1 0 0 C 0 0 0 1 0 - 0 0 0 0 0 Can refine nucleotide substitution scoring if it is known that certain kinds of substitutions are more common (for example, transitions are frequently more common that transversions) (amino-acid sequence alignments usually employ a PAM-based or BLOSUM-based scoring matrix. Can also score based on minimum number of mutational steps from one aa to another) Gaps (indels) typically carry a penalty and are limited in terms of allowed length.
Less Easy (this one needs gap plus mismatch) Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGCGTTTCCC--- Seq B ---GATCG-ATTTCCC--- OR? Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGA-TTTCCC---
Clustal scoring procedures seem to be either simple (as described earlier) or IUB (1.9 for match, 0 for mismatch, match for identity, X or N). And see below.
Computationally-based methods use scoring procedures For example, a simple nucleotide scoring matrix such as the following: A G T C - A 1 0 0 0 0 G 0 1 0 0 0 T 0 0 1 0 0 C 0 0 0 1 0 - 0 0 0 0 0 Can refine nucleotide substitution scoring if it is known that certain kinds of substitutions are more common (for example, transitions are frequently more common that transversions) Amino-acid sequence alignments usually employ a PAM-based or BLOSUM-based scoring matrix. Can also score based on minimum number of mutational steps from one aa to another Gaps (indels) carry a penalty and are limited in terms of allowed length.
A nice Goal: Consider every possible alternative alignment (and find the best) • For example, consider all alignment possibilities for two 2-nucleotide sequences (AG vs AA) assuming at least one homologous position. AG AG- AG- A-G A-G -AGAA -AA -AA -AA AA- AA-
Computationally, things get complicated quickly. Consider some possibilities for two 3-base sequences (AGC vs AAC). Note that anything goes as long as gaps are not lined up together. AGC -AGC AGC- -AGC -AGC AGC- AAC AAC- -AAC AA-C A-AC A-AC AGC- A-GC A-GC A-GC -A-GC -A-GC AA-C AAC- -AAC AA-C AAC-- A-AC- A-GC- A-G-C -AG-C --AGC AGC-- AA-C- -AAC- AA-C- AAC-- --AAC
So, the number of possibilities is greater than 2N (N = length of sequence). This means that the number of alignment possibilities for two 300-amino-acid long proteins is greater than 2300 which is approximately equal to 1090. There may be 1080 elementary particles in the known universe. Therefore, no computer can consider every pairwise alignment possibility. • Then how does one find the optimal alignment? Solution: dynamic programming methods can eliminate search pathways that will not lead to identifying the optimal (best score) alignment.
20 372 ThNM012b TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM012 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM043 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThlanugQH0 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM069 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM070 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM037 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM076 TTCCGCCGGG GGGGTNGTCC CNNGGCTCGG TGTGCCCCCG GGGCCCGTGC ThNM032 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM075 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM007 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC Talthermo CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM073 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC ThNM002 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC AfumHQ6310 GGCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM0026A -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM025a -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC Aspni5 -GCCGCCGGG GGGGCGCCTC TGC------- -----CCCCC GGGCCCGTGC ThNM001 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC ThaurantT8 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC Multiple sequence alignment (Phylip format)
Multiple sequence alignment ClustalW (and others) Steps 1. pairwise alignments 2. UPGMA or Neighbor-Joining tree based on pairwise scores (guide tree) 3. Multiple alignment informed by guide tree
There is a problem with large alignments, which are common in the genomics age The number of pairwise comparisons for a given set of sequences (N) is N(N-1)/2 Example: 4(3)/2 = 6 1: C1_3006100.0098.3697.9996.73 2: C8_9737 100.00 95.33 95.93 3: C2_4297 100.00 98.78 4: C4_3894 100.00 What if 100,000 sequences? 100,000(99,999)/2 ≈ 5 billion (many days of computer time)
Multiple sequence alignment Clustal Omega and others Steps 1. Instead of a matrix of all pairwise comparisons, start with a set of “reference” sequences that are aligned to the remaining sequences 2. UPGMA or Neighbor-Joining tree based on pairwise scores (guide tree) 3. Multiple alignment informed by guide tree
Some useful resources • Clustal Omega analysis European Bioinformatics Institute(https://www.ebi.ac.uk/Tools/msa/clustalo) • ClustalW analysis UNM CETI Galaxy site(http://emil.unm.edu/galaxy) • ClustalX download (http://www.clustal.org/clustal2)