Profiles and multiple Sequence alignments

Profiles and multiple Sequence alignments Understanding Bioinformatics 9th KIAS winter school Lee, Juyong

Contents • Defining profile • PSSM by PSI-BLAST • Profile HMM • Aligning profiles • PSSM & Profile HMM • Generate multiple sequence alignment • Progressive • Other methods

What is Profile? • Represent general properties of the set of sequences • A set of sequences contains more information than a single sequence • Environment is being considered • Two types • Position Specific Scoring Matrix • Profile Hidden Markov Model

Example PSSM

Position specific scoring matrix $> blastpgp -b 0 -j 3 -h 0.001 -d myDB –I mySEQ.fasta –Q myPSSM.mtx –o myMSA.bla

A set of sequences has more information Are K, I and S are meaningful? Are A & T are meaningless? K, I and S are highly conserved! T at the sixth column is also conserved 2nd and 4th columns do not show preference K-IAS-- KAI-ST- K-I-ST- KRISS-- K-I-STI K-IAS- KAI-ST

Generating PSSM Log-odds score of amino acid a at position u Multiple sequence alignment Lack of information should be treated! Not Good ! If a is not observed, m  -∞

Generating PSSM (2)Pseudo-counts : fraction of amino acid a at position u : amino acid a distribution α & β are scaling parameters

Generating PSSM (3)More realistic pseudocounts Use substitution matrix information rather than random alignment! Pseudo count of amino acid a F : frequency of amino acid b at u Formula used in PSI-BLAST

Example of Pseudocount

PSI-BLAST is sequence DB searching program • Goal : Find sequence homologs! • First, perform regular BLAST local search • Build PSSM based on the first round result • Align sequences against PSSM • Update sequence alignment! • Do these iteratively!

Sequence Logo

Profile HMM • Represent general property of a set of sequences based on Hidden Markov Model 0.4 0.1 0.1 0.6 0.5 0.7 0.4 0.2 0.7 0.3 0.6 Emit Amino acid

Profile HMM (2) KIA-S- K-AIST KI--ST KIA-S- K-AIST D1 D2 D3 D4 Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 A S K T I

Profile HMM (3) KIA-S- K-AIST KI--ST KIAS- KI-ST D1 D2 D3 D4 Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 I S T K

Estimate probabilities Transition probability between states Amino acid emission probability

Profile HMM requires a lot of data • Many parameters to be trained • Transition probabilities ~ Nseq * 9 • Amino acid emission probabilities ~ Nseq * 20 • For 100 residue seq, • ~3000 parameters to be tuned • Generally at least 20~30 related sequences are required to build accurate profile HMM

Many possible paths! We need to score them…… QUERY : KRISS D1 D1 D2 D2 D3 D3 D4 D4  Start M1 M2 M3 M4 END Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 I0 I1 I2 I3 I4 S S R I I S S K K R 

How to score a sequence to profile HMM • Two ways of evaluating fitness of a sequence to profile HMM • Through the Most probable path • Viterbi algorithm • Faster, less accurate • Consider all possible paths ! • Forward ( Backward ) algorithm Slower, more accurate

Viterbi algorithm Equivalent to the dynamic programming of pairwise alignment

Forward algorithm • Consider all possible path ! Probability of emitting xi at state Su

Summary • Profile  General property of a query sequence derived from a set of related sequences • Position specific Scoring Matrix • Profile Hidden Markov Model • Can find remote sequence homolog • Those can not be detected by pairwise alignment of sequences

Aligning Profiles • Comparing PSSM • LAMA : no gaps allowed, use Pearson correlation of scores • Prof_sim : gaps allowed, use amino acid distribution at each column • COMPASS : gaps allowed, psuedocounts are used as similar to PSI-BLAST

Aligning profile HMMs COACH, HHsearch are available Can find very remote homologs Position dependent gap scoring is possible

Multiple Sequence Alignment- MSA

Why MSA is difficult? • DP of Pairwise is easy and applicable • Only three cases • If three sequences…… • Seven cases…… • For six sequences…… • 60TB memory required • DP is Impossible  A A A - A A A V V - V - - V L - - L

Methods to align sequences • Progressive method • Add a sequence at a time • ClustalW, T-COFFEE, etc. • Iterative method • Deletion, realigning steps are introduced • Prrp, DIALIGN, MUSCLE and etc.

Order is important! Case 1 Let’s align the followings --D-G-D D-G-D  --G-G-- G-G D-G-G D-G-G-- Case 2 D-G-G G-G D-G-D 

Determine order ! Build phylogenic tree based on all pairwise distance matrix

Which MSA is better?-Scoring scheme Usually Sum of Pairs are used

Scores • ClustalW • Similar to schemes for pairwise alignment • Employ residue-specific gap opening

Scores (2) • T-COFFEE • Score if aligned column is present in the Library • Diverse alignment • Local & Global

Library Extension of T-COFFEE Different Weights for individual columns

Other methods - DIALIGN • Construct whole alignment from ungapped local alignments • Find all ungapped alignments and weight them ! • Key Idea : pairwise alignment can miss biologically important region

Other methods - SAGA • Genetic Algorithm • Alignment  generation • Evolve through mutation & Crossover

Other methods - MSACSA

Thank you!

Profiles and multiple Sequence alignments