Sequence Alignments Some key terms and concepts : Indel (insertion and deletion)

Sequence Alignments Some key terms and concepts: Indel (insertion and deletion) Pairwise sequence alignment (eg. Blast) versus multiple sequence alignment (eg. Clustal) Scoring models and matrices (applies to both amino-acid and nucleotide sequences) Alignment formats (fasta, Clustal, Phylip, Nexus) Pairwise clustering (eg. UPGMA) Format conversion tools Interleaved vs sequential format

Sequence Alignments • Two goals: • 1. Similarity searches to identify homologs. • fastaBlast • 2. Creation of multiple sequence alignments for comparative analysis (phylogenetics, structure-function, etc). • Clustal and others

Pairwise alignments Optimize pairwise alignment based on some scoring scheme • Some famous methods FASTA (Pearson and Lipman) Blast Smith-Waterman

Origin of FASTA

From W.R. Pearson, 1994

FASTA format Sequence name preceded by “>” and followed by a hard return >SeqA GATCGCGTTTCCC >SeqB GATCGATTTCCC >SeqC GATCGGATTTCCC Sequence followed by a hard return

Two typesof alignments • 1. Global • assume homology over entire sequence • 2. Local • look for windows of similarity

The issue: Comparative analysis of homologous nucleotide and amino-acid sequences requires an accurate alignment of sequences. • The problem: Construction of alignments is complicated by indels (insertions and deletions). At issue is where the indels get placed in the alignment. • Approaches to solving this problem can be manual (involving subjective decisions of the person making the alignment) or algorithmic (almost always employing a computer). Often, alignments are created by computer and then refined by the scientist.

Species 1: 3510188 CTGATCCGAGGTCAACCTTGGGTT-GTGAAGGTCGTTTTACGGCTGGAAC 3510237 |||||||||||||||||||||| | | ||||||||||||||||||||||| species 2: 562 CTGATCCGAGGTCAACCTTGGGGTCGCGAAGGTCGTTTTACGGCTGGAAC 513

But decisions need to be made Easy Seq A ---GATCGAGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGAGTTTCCC--- Seq B ---GATCGA-TTTCCC---

Less Easy (this one needs gap plus mismatch) Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGCGTTTCCC--- Seq B ---GATCG-ATTTCCC--- OR? Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGA-TTTCCC---

Computationally-based methods use scoring procedures For example, a simple nucleotide scoring matrix such as the following: A G T C - A 1 0 0 0 0 G 0 1 0 0 0 T 0 0 1 0 0 C 0 0 0 1 0 - 0 0 0 0 0 Can refine nucleotide substitution scoring if it is known that certain kinds of substitutions are more common (for example, transitions are frequently more common that transversions) (amino-acid sequence alignments usually employ a PAM-based or BLOSUM-based scoring matrix. Can also score based on minimum number of mutational steps from one aa to another) Gaps (indels) typically carry a penalty and are limited in terms of allowed length.

Scoring is model dependent

Less Easy (this one needs gap plus mismatch) Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGATTTCCC--- Seq A ---GATCGCGTTTCCC--- Seq B ---GATCG-ATTTCCC--- OR? Seq A ---GATCGCGTTTCCC--- Seq B ---GATCGA-TTTCCC---

Clustal scoring procedures seem to be either simple (as described earlier) or IUB (1.9 for match, 0 for mismatch, match for identity, X or N). And see below.

Computationally-based methods use scoring procedures For example, a simple nucleotide scoring matrix such as the following: A G T C - A 1 0 0 0 0 G 0 1 0 0 0 T 0 0 1 0 0 C 0 0 0 1 0 - 0 0 0 0 0 Can refine nucleotide substitution scoring if it is known that certain kinds of substitutions are more common (for example, transitions are frequently more common that transversions) Amino-acid sequence alignments usually employ a PAM-based or BLOSUM-based scoring matrix. Can also score based on minimum number of mutational steps from one aa to another Gaps (indels) carry a penalty and are limited in terms of allowed length.

A nice Goal: Consider every possible alternative alignment (and find the best) • For example, consider all alignment possibilities for two 2-nucleotide sequences (AG vs AA) assuming at least one homologous position. AG AG- AG- A-G A-G -AGAA -AA -AA -AA AA- AA-

Computationally, things get complicated quickly. Consider some possibilities for two 3-base sequences (AGC vs AAC). Note that anything goes as long as gaps are not lined up together. AGC -AGC AGC- -AGC -AGC AGC- AAC AAC- -AAC AA-C A-AC A-AC AGC- A-GC A-GC A-GC -A-GC -A-GC AA-C AAC- -AAC AA-C AAC-- A-AC- A-GC- A-G-C -AG-C --AGC AGC-- AA-C- -AAC- AA-C- AAC-- --AAC

So, the number of possibilities is greater than 2N (N = length of sequence). This means that the number of alignment possibilities for two 300-amino-acid long proteins is greater than 2300 which is approximately equal to 1090. There may be 1080 elementary particles in the known universe. Therefore, no computer can consider every pairwise alignment possibility. • Then how does one find the optimal alignment? Solution: dynamic programming methods can eliminate search pathways that will not lead to identifying the optimal (best score) alignment.

Multiple Sequence Alignments

Multiple sequence alignment (Clustal format interleaved)

20 372 ThNM012b TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM012 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM043 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThlanugQH0 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM069 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM070 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM037 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM076 TTCCGCCGGG GGGGTNGTCC CNNGGCTCGG TGTGCCCCCG GGGCCCGTGC ThNM032 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM075 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM007 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC Talthermo CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM073 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC ThNM002 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC AfumHQ6310 GGCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM0026A -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM025a -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC Aspni5 -GCCGCCGGG GGGGCGCCTC TGC------- -----CCCCC GGGCCCGTGC ThNM001 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC ThaurantT8 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC Multiple sequence alignment (Phylip format)

Multiple sequence alignment (Phylip format interleaved)

Multiple sequence alignment (Phylip format sequential)

Multiple sequence alignment (Nexus format interleaved)

Multiple sequence alignment ClustalW (and others) Steps 1. pairwise alignments 2. UPGMA or Neighbor-Joining tree based on pairwise scores (guide tree) 3. Multiple alignment informed by guide tree

There is a problem with large alignments, which are common in the genomics age The number of pairwise comparisons for a given set of sequences (N) is N(N-1)/2 Example: 4(3)/2 = 6 1: C1_3006100.0098.3697.9996.73 2: C8_9737 100.00 95.33 95.93 3: C2_4297 100.00 98.78 4: C4_3894 100.00 What if 100,000 sequences? 100,000(99,999)/2 ≈ 5 billion (many days of computer time)

Multiple sequence alignment Clustal Omega and others Steps 1. Instead of a matrix of all pairwise comparisons, start with a set of “reference” sequences that are aligned to the remaining sequences 2. UPGMA or Neighbor-Joining tree based on pairwise scores (guide tree) 3. Multiple alignment informed by guide tree

Some useful resources • Clustal Omega analysis European Bioinformatics Institute(https://www.ebi.ac.uk/Tools/msa/clustalo) • ClustalW analysis UNM CETI Galaxy site(http://emil.unm.edu/galaxy) • ClustalX download (http://www.clustal.org/clustal2)

Sequence Alignments Some key terms and concepts : Indel (insertion and deletion)

Sequence Alignments Some key terms and concepts : Indel (insertion and deletion)

Presentation Transcript

Key Concepts and Terms

Sequence Alignments

KEY GEOGRAPHICAL CONCEPTS and TERMS

Data Structures, Search and Sort Algorithms

Linked Lists

Chapter 13-1 Applied Arrays: Lists and Strings

Sequence Alignments

Some key terms and distinctions

Key concepts, terms and terminology

Multiple Sequence Alignments and Sequence Profiles

Multiple Sequence Alignments

Sequence Local Alignment using Directed Acyclic Word Graph

Some key terms and distinctions

Key Terms and Concepts

Indel rates and probabilistic alignments

Sequence Alignments

Binary Trees

The Mutability and Repair of DNA

ARPAnno: a dedicated web tool for Annotation of Actin Related Proteins

MESSAGE AUTHENTICATION and HASH FUNCTIONS - Chapter 11

Applied Arrays: Lists and Strings

AVL Trees