400 likes | 1.31k Views
Profiles and multiple Sequence alignments. Understanding Bioinformatics 9 th KIAS winter school Lee, Juyong. Contents. Defining profile PSSM by PSI-BLAST Profile HMM Aligning profiles PSSM & Profile HMM Generate multiple sequence alignment Progressive Other methods. What is Profile?.
E N D
Profiles and multiple Sequence alignments Understanding Bioinformatics 9th KIAS winter school Lee, Juyong
Contents • Defining profile • PSSM by PSI-BLAST • Profile HMM • Aligning profiles • PSSM & Profile HMM • Generate multiple sequence alignment • Progressive • Other methods
What is Profile? • Represent general properties of the set of sequences • A set of sequences contains more information than a single sequence • Environment is being considered • Two types • Position Specific Scoring Matrix • Profile Hidden Markov Model
Position specific scoring matrix $> blastpgp -b 0 -j 3 -h 0.001 -d myDB –I mySEQ.fasta –Q myPSSM.mtx –o myMSA.bla
A set of sequences has more information Are K, I and S are meaningful? Are A & T are meaningless? K, I and S are highly conserved! T at the sixth column is also conserved 2nd and 4th columns do not show preference K-IAS-- KAI-ST- K-I-ST- KRISS-- K-I-STI K-IAS- KAI-ST
Generating PSSM Log-odds score of amino acid a at position u Multiple sequence alignment Lack of information should be treated! Not Good ! If a is not observed, m -∞
Generating PSSM (2)Pseudo-counts : fraction of amino acid a at position u : amino acid a distribution α & β are scaling parameters
Generating PSSM (3)More realistic pseudocounts Use substitution matrix information rather than random alignment! Pseudo count of amino acid a F : frequency of amino acid b at u Formula used in PSI-BLAST
PSI-BLAST is sequence DB searching program • Goal : Find sequence homologs! • First, perform regular BLAST local search • Build PSSM based on the first round result • Align sequences against PSSM • Update sequence alignment! • Do these iteratively!
Profile HMM • Represent general property of a set of sequences based on Hidden Markov Model 0.4 0.1 0.1 0.6 0.5 0.7 0.4 0.2 0.7 0.3 0.6 Emit Amino acid
Profile HMM (2) KIA-S- K-AIST KI--ST KIA-S- K-AIST D1 D2 D3 D4 Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 A S K T I
Profile HMM (3) KIA-S- K-AIST KI--ST KIAS- KI-ST D1 D2 D3 D4 Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 I S T K
Estimate probabilities Transition probability between states Amino acid emission probability
Profile HMM requires a lot of data • Many parameters to be trained • Transition probabilities ~ Nseq * 9 • Amino acid emission probabilities ~ Nseq * 20 • For 100 residue seq, • ~3000 parameters to be tuned • Generally at least 20~30 related sequences are required to build accurate profile HMM
Many possible paths! We need to score them…… QUERY : KRISS D1 D1 D2 D2 D3 D3 D4 D4 Start M1 M2 M3 M4 END Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 I0 I1 I2 I3 I4 S S R I I S S K K R
How to score a sequence to profile HMM • Two ways of evaluating fitness of a sequence to profile HMM • Through the Most probable path • Viterbi algorithm • Faster, less accurate • Consider all possible paths ! • Forward ( Backward ) algorithm Slower, more accurate
Viterbi algorithm Equivalent to the dynamic programming of pairwise alignment
Forward algorithm • Consider all possible path ! Probability of emitting xi at state Su
Summary • Profile General property of a query sequence derived from a set of related sequences • Position specific Scoring Matrix • Profile Hidden Markov Model • Can find remote sequence homolog • Those can not be detected by pairwise alignment of sequences
Aligning Profiles • Comparing PSSM • LAMA : no gaps allowed, use Pearson correlation of scores • Prof_sim : gaps allowed, use amino acid distribution at each column • COMPASS : gaps allowed, psuedocounts are used as similar to PSI-BLAST
Aligning profile HMMs COACH, HHsearch are available Can find very remote homologs Position dependent gap scoring is possible
Why MSA is difficult? • DP of Pairwise is easy and applicable • Only three cases • If three sequences…… • Seven cases…… • For six sequences…… • 60TB memory required • DP is Impossible A A A - A A A V V - V - - V L - - L
Methods to align sequences • Progressive method • Add a sequence at a time • ClustalW, T-COFFEE, etc. • Iterative method • Deletion, realigning steps are introduced • Prrp, DIALIGN, MUSCLE and etc.
Order is important! Case 1 Let’s align the followings --D-G-D D-G-D --G-G-- G-G D-G-G D-G-G-- Case 2 D-G-G G-G D-G-D
Determine order ! Build phylogenic tree based on all pairwise distance matrix
Which MSA is better?-Scoring scheme Usually Sum of Pairs are used
Scores • ClustalW • Similar to schemes for pairwise alignment • Employ residue-specific gap opening
Scores (2) • T-COFFEE • Score if aligned column is present in the Library • Diverse alignment • Local & Global
Library Extension of T-COFFEE Different Weights for individual columns
Other methods - DIALIGN • Construct whole alignment from ungapped local alignments • Find all ungapped alignments and weight them ! • Key Idea : pairwise alignment can miss biologically important region
Other methods - SAGA • Genetic Algorithm • Alignment generation • Evolve through mutation & Crossover