1 / 81

52930 Protein informatics

52930 Protein informatics. Liisa Holm. Organization. Lectures Wednesdays 6 September to 14 October Exam Friday 16 October Essay type question Numerical problems Textbook DW Mount: Bioinformatics. Sequence and genome analysis. 2 nd edition. Chapters 3-7,10-11 Web site

geordi
Download Presentation

52930 Protein informatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 52930 Protein informatics Liisa Holm

  2. Organization • Lectures • Wednesdays 6 September to 14 October • Exam Friday 16 October • Essay type question • Numerical problems • Textbook • DW Mount: Bioinformatics. Sequence and genome analysis. 2nd edition. Chapters 3-7,10-11 • Web site • http://ekhidna.biocenter.helsinki.fi/teaching/winter2009/proteiinianalyysi

  3. Aims & scope • Expose biology students to background of methods • Related practical course • Practical course in protein informatics (Proteiinianalyysinharjoitustyöt) • Hands-on practice in using web servers that implement methods • Neither course required for the other

  4. Topics • Pairwise alignment • Probability and statistical analysis of sequence alignments • Multiple sequence alignment • Database searching • Phylogenetic prediction • Protein classification and structure prediction • Genome annotation

  5. Pairwise alignment

  6. Why align sequences? • Common ancestor • Infer common evolutionary origin from similarity • Then can infer function and structure • Similarity can be due to • Gene duplication + speciation • Horizontal gene transfer • Gene fusion • Convergence (similarity without homology) Sequence A Sequence B x steps y steps Ancestral sequence

  7. Similar sequences are likely homologous • Dissimilar sequences are less likely to be homologous

  8. 4-letter word example • This is not the usual substitution model • WORD (d=0, p=1/N^4) • WORE (d=1, p=4/N^3) • GORE (d=2, p=6/N^2) • GONE (d=3, p=4/N) • GENE (d=4, p=1)

  9. Optimal alignment • Assuming independence between scores for each position, the optimal alignment can be determined using dynamic programming • Setup: scoring matrix, gap penalties

  10. A B 0 3 3 4 C 1 1 2 Dynamic programming BEGIN END D • Maximal path sum BEGIN  END ? • Enumerate every path brute force • Use induction: only one optimal path up to any node in graph.

  11. A B C Example: all paths leading to B 3 0 3 3 8 3 4 1 BEGIN 7 END 1 2 1 D

  12. Global alignment • Needleman-Wunsch algorithm • Maximal trace from beginning to end • Global alignment score may be negative

  13. Local alignment • Aligned region truncated to segment giving the largest positive contribution

  14. Scoring alignments • Substitution matrices • Gap penalties • Significance • Aligning two sequences, would you expect the same level of similarity by chance alone?

  15. Conversion between odds score, log odds and bit scores • Odds score = ratio of likelihoods of two events or outcomes. E.g. observed frequency of aligned A and B in related sequences divided by the frequency with which A and B align by chance • f(A and B) / [ f(A) * f(B)] • Odds scores are often converted to logarithms to create log odds scores. • Log odds scores are additive. • Bit score = log odds score converted to a logarithm to the base 2

  16. Bit-scores • The score needed to distinguish an MSP from chance is approximately the number of bits needed to specify where the MSP starts in each of the two sequences being compared • MSP = maximally scoring pair • Ungapped alignment case • Log2N bits are needed to distinguish among N possibilities • Two proteins of 250 residues: 16 bits • Database of 4M residues: 30 bits [160 M: 34 bits]

  17. Dayhoff model • Markov chain: mutations independent of previous mutations • Data: 71 groups of closely related sequnces (>85 % similarity), yielding 1572 substitution events • Mutability of amino acid types (per 100 accepted point mutations)

  18. PAM1 and PAM250 for Phe -> X These are mutation probabilities!

  19. Log odds form of PAM250 • Unit is 10 * logarithm to the base 10 of ratio • S(A,B) = ½ * (10 * log10(p(A->B)/f(A)) + 10 * log10 (p(B->A)/f(B)) • Range -8 … +17 • Local alignment scores are maximal, when PAM distance corresponds to the similarity of the target sequences

  20. BLOSUM matrices • The BLOSUM matrixassigns a probabilityscore for eachresiduepair in an alignmentbased on: • the frequencywithwhichthatpairing is known to occurwithinconservedblocks of relatedproteins. • BLOSUM matricesareconstructedfromobservationswhichlead to observedprobabilities

  21. BLOSUM substitution matrices • BLOSUM matrices are used in ‘log-odds’ form based on actually observed substitutions. • This is because: • Ease of use: ‘Scores’ can be just added (the raw probabilities would have to be multiplied) • Ease of interpretation: • S=0 : substitution is just as likely to occur as random • S<0 : substitution is more likely to occur randomly than observed • S>0 : substitution is less likely to occur randomly than observed • Unit is half-bits (odds ratio to logarithm base 2, multiplied by 2)

  22. Information content • Using a standard measure for overall amino acid frequencies gives the information content of a random protein sequence as 4.19 bits/residue. • Thus, for an average size protein domain (150 residues), the message length is ~630 bits and the probability that 2 random sequences would specify the same message is 2-630 (10-190). > Database searching for protein similarities is doable, even for fairly short sequences • BUT, for a transcription binding site of 8-10 bp, the odds of 2 random sequences arriving at the same message is 10-5. > Database searching for regulatory elements does not work well as databases get larger

  23. Relative entropy H of target and background distributions • Scale score matrix s to bits qij • H = S qij sij = S qij log ----------- pi pj q = target frequencies of amino acids p = background frequencies H measures the average information available per position to distinguish the alignment from chance

  24. qij • Score = S fij sij ~ S fij ln ----------- pi pj Optimal scoring matrix: target distribution q = frequencies in alignment f

  25. Affine gap penalties • Gap opening penalty (g) • Gap extension penalty (r) • W(x) = g + rx • X is the length of the gap • Well working gap penalties: • BLSOUM62 (-11,-1)

  26. Statistical Significance • A good way to determine if an alignment score has statistical meaning is to compare it with the score generated from the alignment of two random sequences • A model of ‘random’ sequences is needed. The simplest model chooses the amino acid residues in a sequence independently, with background probabilities (Karlin & Altschul (1990) Proc. Natl. Acad. Sci. USA, 87 (1990) 2264-2268)

  27. Alignment score • Optimal alignment scores follow extreme value distribution • Exact theory for ungapped local alignments • There is at least one positive score sij • Average score is negative • Results hold empirically for gapped alignments

  28. Probability and statistics

  29. The need for statistics • Statistics is very important for bioinformatics. • It is very easy to have a computer analyze the data and give you back a result. • Problem is to decide whether the answer the computer gives you is any good at all. • Questions: • How statistically significant is the answer? • What is the probability that this answer could have been obtained by random? What does this depend on?

  30. Basics N n Sample Population

  31. Basics N Descriptive statistics n Sample Population Probability

  32. Substitution matrices Score of amino acid a with amino acid b Pab is the observed frequency that residues a and b are correlated because of homology Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b Source: Where did the BLOSUM62 alignment score matrix come from?Eddy S., Nat. Biotech. 22 Aug 2004

  33. Substitution matrices Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers Pab is the observed frequency that residues a and b are correlated because of homology fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b

  34. i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =

  35. i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 5.7

  36. i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 5.7

  37. i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 5.7

  38. Example: BLAST • Motivations • Exactalgorithmsareexhaustivebutcomputationallyexpensive. • Exactalgorithmsareimpractical for comparing a querysequence to millions of othersequences in a database (databasescanning), • and so, databasescanningrequiresheuristicalignmentalgorithm (at the cost of optimality).

  39. ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank Links Gene/sequence Definition Bit score – higher, better. Click to access the pairwise alignment Expect value – lower, better. It tells the possibility that this is a random hit Interpret BLAST results - Description

  40. Problems with BLAST • Why do results change? • How can you compare results from different BLAST tools which may report different types of values? • How are results (egevalue) affected by query • There are _many_ values reported in the output – what do they mean?

  41. Example: Importance of Blaststatistics • But, first a review.

  42. Review • What is a distribution? • A plot showing the frequency of a given variable or observation.

  43. Review • What is a distribution? • A plot showing the frequency of a given variable or observation.

  44. Features of a Normal Distribution • Symmetric Distribution • Has an average or mean value at the centre • Has a characteristic width called the standard deviation (S.D. = σ) • Most common type of distribution known m = mean

  45. Standard Deviations (Z-score)

  46. Mean, Median & Mode Mode Median Mean

  47. Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value

  48. Different Distributions Unimodal Bimodal

  49. Other Distributions • Binomial Distribution • the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. • Poisson Distribution • expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event. • Extreme Value Distribution • Gumbel distribution • used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions.

More Related