240 likes | 439 Views
A Brief Introduction to Biological Sequence Alignment. Sun Kim CSE SNU For Bio Data Mining 4541.776.002 Sep 2011. Aligning a pair of sequences. Problem: given a pair of sequences, find the best alignment among all possible alignments. Goal: to compute the best alignment needs
E N D
A Brief Introduction to Biological Sequence Alignment Sun Kim CSE SNU For Bio Data Mining 4541.776.002 Sep 2011 Bio & Health Informatics Lab, SNU
Aligning a pair of sequences • Problem: • given a pair of sequences, find the best alignment among all possible alignments. • Goal: to compute the best alignment needs • type of the alignment • a scoring scheme • A scoring matrix • Gap penalty scheme • Two types of alignment problems • Local sequence alignment • Global sequence alignment Bio & Health Informatics Lab, SNU
glutamate-ammonia ligase related sequences Query sequence 1 >A8XYH6 A8XYH6_CAEBR CBR-GLN-2 protein [Caenorhabditis briggsae] MTHLNFETRMPLGQAVIDQFLGLRPHPTKIQATYVWIDGTGENLRSKTRTFDRLPKKIED YPIWNYDGSSTGQAKGRDSDRYLRPVAAYPDPFLGGANKLVMCDTLDHEMQPTATNHRQA CAEIMNEIRDTRPWFGMEQEYLIVDRDEHPLGWPKHGFPAPQGKYYCSVGADRAFGREVV ETHYRACLHAGLNIFGTNAEVTPGQWEFQIGTCEGIDMGDQLWMSRYILHRVAEQFGVCV SLDPKPKVTMGDWNGAGCHTNFSTAEMRAPGGIAAIEAAMEGLKRTHLEAMKVYDPHGGE DNLRRLTGRHETSSADKFSWGVANRGCSIRIPRQVAAERKGYLEDRRPSSNCDPYQVTAM IAQSILL Query sequence 2 >O02225 O02225_CAEEL Protein C28D4.3, confirmed by transcript evidence [Caenorhabditis elegans] MSHLNYETRLPLGQATIDHFMGLPAHPTKCQATYVWIDGTGEHLRAKTRTINTKPQYLSE YPIWNYDGSSTGQADGLNSDRYLRPVAVFPDPFLGGLNVLVMCDTLDHEMKPTATNHRQM CAELMKKVSDQQPWFGMEQEYLIVDRDEHPLGWPKHGYPAPQGKYYCGIGADRAFGREVV ETHYRACLHAGITIFGSNAEVTPGQWEFQIGTCLGIEMGDQLWMARYILHRVAEQFGVCV SLDPKPRVTMGDWNGAGCHTNFSTIDMRRPDGLETIIAAMEGLKKTHSEAMKVYDPNGGH DNLRRLTGRHETSQADQFSWGIANRACSVRIPRQVADEGRGYLEDRRPSSNCDPYLVTAM IVKSVLIN Bio & Health Informatics Lab, SNU
A Pairwise Alignment of The Two Sequences. Bio & Health Informatics Lab, SNU
Scoring matrix BLOSUM 62 Bio & Health Informatics Lab, SNU
Compute A Score for A Pairwise Alignment of The Two Sequences. Adding scores in the scoring matrix: S(M,M) + S(T,S) + S(H,H) + ….. Bio & Health Informatics Lab, SNU
Gap Penalty and Scoring Matrix • Gap penalty • http://en.wikipedia.org/wiki/Gap_penalty • http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformaticsHM/slides/scoring_matrices.pdf Bio & Health Informatics Lab, SNU
Computing The Best Alignment • Until now, we assume that an alignment is “given” to compute a score of an alignment. • The pairwise sequence alignment problem is to compute “the best alignment” among all possible alignments. • Alignment 1 score 1 • Alignment 2 score 2 • … • Then select Alignment k whose score is the best among all. • However, there are too many alignments to consider. • Fortunately, we can use the dynamic programming technique to find the best alignment in a quadratic time and space. Bio & Health Informatics Lab, SNU
Levenshtein distance(Edit distance) • http://en.wikipedia.org/wiki/Levenshtein_distance Bio & Health Informatics Lab, SNU
Global Alignment Algorithm • Needleman-Wunch algorithm • http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm Bio & Health Informatics Lab, SNU
Local Alignment Algorithm • Smith–Waterman algorithm • http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm • http://docencia.ac.upc.edu/master/AMPP/slides/ampp_sw_presentation.pdf Bio & Health Informatics Lab, SNU
BLAST • http://en.wikipedia.org/wiki/BLAST Bio & Health Informatics Lab, SNU
FASTA • http://en.wikipedia.org/wiki/FASTA Bio & Health Informatics Lab, SNU
Statistical Evaluation of Search Result • Although the alignment algorithms look for the ‘optimal’ one (the best in terms of a scoring scheme), there is no guarantee that the human-invented optimal one is biologically meaningful though the optimality incorporated `the domain knowledge’. • Thus the final step in bioinformatics is to compare how likely it is by chance. • The definition of the random model is very important; in many cases, how to define random models (negative models) is a very important research topic. Bio & Health Informatics Lab, SNU
Multiple Sequence Alignment • Aligning multiple sequences is an important for many applications in bioinformatics. • The computing optimal multiple sequence alignment is still an open problem. • Defining the optimality criteria (scoring scheme?, gap penalty score?). • Computational complexity. Bio & Health Informatics Lab, SNU
Local vs. Global Multiple Sequence Alignment • Like the pairwise sequence alignment, there are two types of alignment problems, local and global. • Since there are many sequences, another factor needs to be considered. • The alignment of the whole set or a subset of the input sequence set? Bio & Health Informatics Lab, SNU
Scoring Scheme for the Multiple Sequence Alignment • Sum of pairs. • Since any scoring matrix, eg., BLOSUM62, shows a score of only a pair of amino acid or nucleotide characters. • Information theoretic scoring scheme. • A nice way to consider multiple characters together but it is hard to utilize the domain knowledge (well established scoring matrix, eg., BLOSUM62). Bio & Health Informatics Lab, SNU
Global Multiple Sequence Alignment • Progressive alignment. • Pattern (k-mer)-based strategy. • Computing the optimal alignment. Bio & Health Informatics Lab, SNU
Local Multiple Sequence Alignment • This is also known as (a.k.a) the motif discovery problem. • Many machine learning techniques are used: Gibbs sampling, Expectation-Maximization, Information theory. • It will be covered in a separate lecture. Bio & Health Informatics Lab, SNU
List of Multiple Sequence Alignment • http://en.wikipedia.org/wiki/Multiple_sequence_alignment 1 Dynamic programming and computational complexity 2 Progressive alignment construction 3 Iterative methods 4 Hidden Markov models 5 Genetic algorithms and simulated annealing 6 Motif finding 7 Visualization and editing tools Bio & Health Informatics Lab, SNU
ClustalW • The most widely used “progressive alignment” algorithm. • Starting by computing alignments of all possible pairs of input sequences. • Building a guiding tree by using the UPGMA algorithm. • Following the guide tree, it constructs the multiple sequence alignment in a “greedy” fashion. Bio & Health Informatics Lab, SNU
MUSCLE • MUSCLE (multiple sequence comparison by log-expectation) -- Nucleic Acids Research, 2004, Vol. 32, No. 5 • A very nice, iterative progressive alignment algorithm using k-mers. Bio & Health Informatics Lab, SNU