280 likes | 980 Views
In this lecture, Dr. Emad Nabil explains the concepts and importance of multiple sequence alignment (MSA) in computational biology and bioinformatics. He discusses scoring functions, algorithms for MSA, and the tasks involved in creating an alignment.
E N D
Computational Biology and Bioinformatics Multiple sequence Alignment Lecture #6 – By: Dr. Emad Nabil Fall 2018 FCI-CU
At the end of this lecture you will be able to develop a program with this input and produce the output below input >Rosalind_18 GACATGTTTGTTTGCCTTAAACTCGTGGCGGCCTAGCCGTAAGTTAAG >Rosalind_23 ACTCATGTTTGTTTGCCTTAAACTCTTGGCGGCTTAGCCGTAACTTAAG >Rosalind_51 TCCTATGTTTGTTTGCCTCAAACTCTTGGCGGCCTAGCCGTAAGGTAAG >Rosalind_7 CACGTCTGTTCGCCTAAAACTTTGATTGCCGGCCTACGCTAGTTAGTTA >Rosalind_28 GGGGTCATGGCTGTTTGCCTTAAACCCTTGGCGGCCTAGCCGTAATGTTT output phylogenetic tree http://www.ebi.ac.uk/goldman-srv/webprank/ More MSA Tools : http://www.ebi.ac.uk/Tools/msa/
Agenda • What is MSA? • what is its importance? • Scoring function: • Entropy based • Sum of pairs • How to align many sequences ? Algorithms • Progressive alignment • Star • Dependent upon a center • Keep adding all pairs of aligned sequences with the current alignment • Tree • Create an approximate guide tree • Use tree to align the sequences • Iterative alignment • Don’t commit to the fixed ordering, revisit the alignment until score does not change
What is Multiple sequence alignment • A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally • protein • DNA • RNA
Why MSA is important? • Build phylogenetic trees, Determine evolutionary relationships between sequences • A multiple sequence alignment can represent a family of proteins with similar function, Compare new sequence to a “family” of known proteins • Discover common signatures or protein domains among a group of proteins • Identify genetic variation among individuals of a population.
Why MSA is important? • A low (and statistically insignificant) similarity between two sequences becomes significant if it is present in many other sequences. • Multiple alignments can reveal subtle (precise) similarities that pairwise alignments do not reveal. What is the most similar set of DNAs from the above group.
Alignment of Three A-domains A success story of MSA Identification of Non-ribosomal code YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA -AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
Agenda • What is MSA? • what is its importance? • Scoring function: • Entropy based • Sum of pairs • How to align many sequences ? Algorithms • Progressive alignment • Star • Dependent upon a center • Keep adding all pairs of aligned sequences with the current alignment • Tree • Create an approximate guide tree • Use tree to align the sequences • Iterative alignment • Don’t commit to the fixed ordering, revisit the alignment until score does not change
The tasks in Multiple Sequence Alignment Algorithms for creating an alignment Scoring an alignment
Generalizing Pairwise to Multiple Alignment • Alignment of 2 sequences is a 2-row matrix. • Alignment of 3 sequences is a 3-row matrix AT - G C G - A - C G T - A ATC A C - A • Our scoring function should score alignments with conserved columns higher.
Analogy • Think of the k=2 case Every alignment is a path through a 2D matrix • The three possible directions (down, right, down-right) conform/fit to the three possible permutations in a column (XX, X_, _X) • With growing paths, we align growing prefixes of both sequences
Multiple Alignment: Dynamic Programming • Assume k=3 , Think of a 3-dimensional cube with the three sequences giving the values in each dimension • Now, we have paths aligning growing prefixes of three sequences • Every column has seven possible alternatives (XXX, XX_, X_X, _XX, X_ _, _ X_, _ _X) 2D 3D matrix matrix Dynamic Programming in 2D ,(x, y) is an entry in the 2-D scoring matrix. Dynamic Programming in 3D, (x, y, z) is an entry in the 3-D scoring matrix. Alignment path in 3D Alignment path in 2D
Multiple Alignment: Dynamic Programming (x, y, z) is an entry in the 3-D scoring matrix. (x, y) is an entry in the 2-D scoring matrix.
Multiple Alignment: Running Time For 3 sequences of length n: – There are 3 variables so you need cube for each cell, so you need n3 cubes matrix for the full space – For each cell (bottom-right-front corner), we need to look at 7 corners – Together: O(7*n3) computations =(7=23-1)* n3 • For k sequences of length n – There are nkcell corners in the cube – For each corner, we need to look at 2k-1 other corners – Together: O(2k* nk) computations The problem is NP-complete
Find a Highest-Scoring Multiple Sequence Alignment the score of an alignment column is 1 if all three symbols are identical and 0 otherwise. Note : The backtracking matrix is 3D and each cell has values from 0 to 6 orfrom 1 to 7 http://rosalind.info/problems/ba5m/
Scoring a Multiple Sequence Alignment (MSA) Entropy Sum of pairs
Some notations Row • Let m denote a Multiple Sequence Alignment • mi is the ith column of the alignment m • mij is the ith column and jth row • ciacount of residue a in column i column G A R F I E L D T H E F A T C A T G A R F I E L D T H E - - - C A T G A R F I E L D T H A T - - C A T G A R R Y - L I K E D A - - C A T
Scoring a Multiple Sequence Alignment (MSA) • Key issue: how do we score a multiple sequence alignment? • Usually, we assume that columns of an alignment are independent • For now, we will simplify the score by assuming a linear gap penalty • Linear gap penalty can be incorporated into the substitution matrix • S(a,-)=-s=S(-,a) • S(-,-)=0
Scoring of a column: Sum of Pairs • Compute the sum of the pairwise scores Example Iterate over all pairs of rows in the column Substitution score from a substitution matrix such as BLOSUM or PAM Scoring of a column= S(A,C)+S(A,G)+A(A,T)+ S(C,G)+S(C,T)+ (G,T) • combinations = = =6
Entropy is a measure of the uncertainty of a probability distribution (p1, …, pN): Entropy for a multiple alignment is the sum of entropies of its columns: gap will be treaded as a base pair.
= 0.9503 Entropy=