810 likes | 980 Views
52930 Protein informatics. Liisa Holm. Organization. Lectures Wednesdays 6 September to 14 October Exam Friday 16 October Essay type question Numerical problems Textbook DW Mount: Bioinformatics. Sequence and genome analysis. 2 nd edition. Chapters 3-7,10-11 Web site
E N D
52930 Protein informatics Liisa Holm
Organization • Lectures • Wednesdays 6 September to 14 October • Exam Friday 16 October • Essay type question • Numerical problems • Textbook • DW Mount: Bioinformatics. Sequence and genome analysis. 2nd edition. Chapters 3-7,10-11 • Web site • http://ekhidna.biocenter.helsinki.fi/teaching/winter2009/proteiinianalyysi
Aims & scope • Expose biology students to background of methods • Related practical course • Practical course in protein informatics (Proteiinianalyysinharjoitustyöt) • Hands-on practice in using web servers that implement methods • Neither course required for the other
Topics • Pairwise alignment • Probability and statistical analysis of sequence alignments • Multiple sequence alignment • Database searching • Phylogenetic prediction • Protein classification and structure prediction • Genome annotation
Why align sequences? • Common ancestor • Infer common evolutionary origin from similarity • Then can infer function and structure • Similarity can be due to • Gene duplication + speciation • Horizontal gene transfer • Gene fusion • Convergence (similarity without homology) Sequence A Sequence B x steps y steps Ancestral sequence
Similar sequences are likely homologous • Dissimilar sequences are less likely to be homologous
4-letter word example • This is not the usual substitution model • WORD (d=0, p=1/N^4) • WORE (d=1, p=4/N^3) • GORE (d=2, p=6/N^2) • GONE (d=3, p=4/N) • GENE (d=4, p=1)
Optimal alignment • Assuming independence between scores for each position, the optimal alignment can be determined using dynamic programming • Setup: scoring matrix, gap penalties
A B 0 3 3 4 C 1 1 2 Dynamic programming BEGIN END D • Maximal path sum BEGIN END ? • Enumerate every path brute force • Use induction: only one optimal path up to any node in graph.
A B C Example: all paths leading to B 3 0 3 3 8 3 4 1 BEGIN 7 END 1 2 1 D
Global alignment • Needleman-Wunsch algorithm • Maximal trace from beginning to end • Global alignment score may be negative
Local alignment • Aligned region truncated to segment giving the largest positive contribution
Scoring alignments • Substitution matrices • Gap penalties • Significance • Aligning two sequences, would you expect the same level of similarity by chance alone?
Conversion between odds score, log odds and bit scores • Odds score = ratio of likelihoods of two events or outcomes. E.g. observed frequency of aligned A and B in related sequences divided by the frequency with which A and B align by chance • f(A and B) / [ f(A) * f(B)] • Odds scores are often converted to logarithms to create log odds scores. • Log odds scores are additive. • Bit score = log odds score converted to a logarithm to the base 2
Bit-scores • The score needed to distinguish an MSP from chance is approximately the number of bits needed to specify where the MSP starts in each of the two sequences being compared • MSP = maximally scoring pair • Ungapped alignment case • Log2N bits are needed to distinguish among N possibilities • Two proteins of 250 residues: 16 bits • Database of 4M residues: 30 bits [160 M: 34 bits]
Dayhoff model • Markov chain: mutations independent of previous mutations • Data: 71 groups of closely related sequnces (>85 % similarity), yielding 1572 substitution events • Mutability of amino acid types (per 100 accepted point mutations)
PAM1 and PAM250 for Phe -> X These are mutation probabilities!
Log odds form of PAM250 • Unit is 10 * logarithm to the base 10 of ratio • S(A,B) = ½ * (10 * log10(p(A->B)/f(A)) + 10 * log10 (p(B->A)/f(B)) • Range -8 … +17 • Local alignment scores are maximal, when PAM distance corresponds to the similarity of the target sequences
BLOSUM matrices • The BLOSUM matrixassigns a probabilityscore for eachresiduepair in an alignmentbased on: • the frequencywithwhichthatpairing is known to occurwithinconservedblocks of relatedproteins. • BLOSUM matricesareconstructedfromobservationswhichlead to observedprobabilities
BLOSUM substitution matrices • BLOSUM matrices are used in ‘log-odds’ form based on actually observed substitutions. • This is because: • Ease of use: ‘Scores’ can be just added (the raw probabilities would have to be multiplied) • Ease of interpretation: • S=0 : substitution is just as likely to occur as random • S<0 : substitution is more likely to occur randomly than observed • S>0 : substitution is less likely to occur randomly than observed • Unit is half-bits (odds ratio to logarithm base 2, multiplied by 2)
Information content • Using a standard measure for overall amino acid frequencies gives the information content of a random protein sequence as 4.19 bits/residue. • Thus, for an average size protein domain (150 residues), the message length is ~630 bits and the probability that 2 random sequences would specify the same message is 2-630 (10-190). > Database searching for protein similarities is doable, even for fairly short sequences • BUT, for a transcription binding site of 8-10 bp, the odds of 2 random sequences arriving at the same message is 10-5. > Database searching for regulatory elements does not work well as databases get larger
Relative entropy H of target and background distributions • Scale score matrix s to bits qij • H = S qij sij = S qij log ----------- pi pj q = target frequencies of amino acids p = background frequencies H measures the average information available per position to distinguish the alignment from chance
qij • Score = S fij sij ~ S fij ln ----------- pi pj Optimal scoring matrix: target distribution q = frequencies in alignment f
Affine gap penalties • Gap opening penalty (g) • Gap extension penalty (r) • W(x) = g + rx • X is the length of the gap • Well working gap penalties: • BLSOUM62 (-11,-1)
Statistical Significance • A good way to determine if an alignment score has statistical meaning is to compare it with the score generated from the alignment of two random sequences • A model of ‘random’ sequences is needed. The simplest model chooses the amino acid residues in a sequence independently, with background probabilities (Karlin & Altschul (1990) Proc. Natl. Acad. Sci. USA, 87 (1990) 2264-2268)
Alignment score • Optimal alignment scores follow extreme value distribution • Exact theory for ungapped local alignments • There is at least one positive score sij • Average score is negative • Results hold empirically for gapped alignments
The need for statistics • Statistics is very important for bioinformatics. • It is very easy to have a computer analyze the data and give you back a result. • Problem is to decide whether the answer the computer gives you is any good at all. • Questions: • How statistically significant is the answer? • What is the probability that this answer could have been obtained by random? What does this depend on?
Basics N n Sample Population
Basics N Descriptive statistics n Sample Population Probability
Substitution matrices Score of amino acid a with amino acid b Pab is the observed frequency that residues a and b are correlated because of homology Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b Source: Where did the BLOSUM62 alignment score matrix come from?Eddy S., Nat. Biotech. 22 Aug 2004
Substitution matrices Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers Pab is the observed frequency that residues a and b are correlated because of homology fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b
i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =
i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 5.7
i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 5.7
i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 5.7
Example: BLAST • Motivations • Exactalgorithmsareexhaustivebutcomputationallyexpensive. • Exactalgorithmsareimpractical for comparing a querysequence to millions of othersequences in a database (databasescanning), • and so, databasescanningrequiresheuristicalignmentalgorithm (at the cost of optimality).
ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank Links Gene/sequence Definition Bit score – higher, better. Click to access the pairwise alignment Expect value – lower, better. It tells the possibility that this is a random hit Interpret BLAST results - Description
Problems with BLAST • Why do results change? • How can you compare results from different BLAST tools which may report different types of values? • How are results (egevalue) affected by query • There are _many_ values reported in the output – what do they mean?
Example: Importance of Blaststatistics • But, first a review.
Review • What is a distribution? • A plot showing the frequency of a given variable or observation.
Review • What is a distribution? • A plot showing the frequency of a given variable or observation.
Features of a Normal Distribution • Symmetric Distribution • Has an average or mean value at the centre • Has a characteristic width called the standard deviation (S.D. = σ) • Most common type of distribution known m = mean
Mean, Median & Mode Mode Median Mean
Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value
Different Distributions Unimodal Bimodal
Other Distributions • Binomial Distribution • the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. • Poisson Distribution • expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event. • Extreme Value Distribution • Gumbel distribution • used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions.