100 likes | 314 Views
Position-Specific Substitution Matrices. PSSM. A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where in the protein they are.
E N D
PSSM • A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where in the protein they are. • This is obviously only an approximation: within a family of related proteins, some residues are very important for function and hardly change at all, while others can vary quite a bit. • Position-specific substitution matrices are an approach to this problem: developing a different substitution matrix for each position in a set of aligned proteins. • Requires a set of aligned, related proteins • Gaps can be a problem: do you have a separate gap opening and extension penalty for each position, or do you use the same value for all positions? • Most PSSM use a single set of values • Hidden Markov Models address this question specifically • PSI-BLAST is the primary general use of PSSM.
Some Aligned Sequences gi|154350476|gb|ABS72555.1| AVPLMQPEAPIVGTGMEYVSGKDSGAAVICKHPGIVERVEAKNVWVRRYE gi|225184649|emb|CAB11883.2| AVPLMQPEAPFVGTGMEYVSGKDSGAAVICKHPGIVERVEAKNVWVRRYE gi|157679649|gb|ABV60793.1| AVPLMQPESPIVGTGMEYVSGKDSGAAVICRYPGVVERVEAKNIWVRRYE gi|52346465|gb|AAU39099.1| AVPLMQPESPIVGTGMEYVSAKDSGAAVICRHPGIVERVEAKNIWVRRYE BMQ_0128 AVPLLNPEAPIVGTGMEYVSGKDSGAAVICKYPGVVERVEAKQIIVRRYE gi|42735098|gb|AAS39038.1| AVPLMNPESPIVGTGMEYVSAKDSGAAVICKHPGIVERVEAREVWVRRYV gi|10172738|dbj|BAB03845.1| AVPLLVPEAPIVGTGMEHVSAKDSGAAIVSKHRGIVERVTAKEIWVRRLE gi|56908158|dbj|BAD62685.1| AVPLLVPEAPLVGTGMEHVSAKDSGAAVVSKYAGIVERVTAKEIWVRRIE ****: **:*:******:**.******::.:: *:**** *::: ***
Making a PSSM • There are several variations on the theme,but the best way in analogous to how substitution matrices are made, using the log-odds method. • Start with a set of aligned sequences. For each position, count the number of each type of amino acid that has occurred. The frequency of amino acid a in column u is qu,a • Note that we aren’t counting substitutions here, since in a multiple alignment we don’t know how the different sequences are related. • We also need to know the frequency of amino acid a among sequences in general, pa. • The odds ratio is the frequency of amino acid a given real-world evolution divided by the frequency expected if amino acids are completely random. = qu,a / pa • Finally, take the logarithm so scores can be added. • mu,a is the score used for amino acid a in column u. This needs to be done for all amino acids in all columns. • mu,a = log (qu,a / pa) • It is possible to weight the scores to compensate for bias in the original sequence selection.
The Missing Data Problem • You are trying to determine the frequency of all 20 amino acids at each position in the sequence. There are inevitably some amino acids that never occur in certain positions. • However, if they do occur, in a new sequence, their score mu,a = log (qu,a / pa) would be negative infinity, the logarithm of 0. This is not a useful score. • The simplest solution is to simply start counting at 1 instead of 0. The counts are then referred to as pseudocounts. • Normally the frequency of amino acid a in column u is qu,a = nu,a / N, where N is the number of sequences being examined. • For pseudocounts, qu,a = (nu,a + 1)/ (N + 20). The N+20 term is because there are 20 amino acids. • Slightly more sophisticated is using the proportions of each amino acid in the database, pa. The sum of all 20 pa is 1. • qu,a = (nu,a + pa)/ (N + 1). • Related to this is using data from a substitution matrix as the source of the proportions. • By adding constants, you can vary the proportions of pseudocounts and real counts, depending on how much real data you have. • More sophisticated methods also exist.
Information and Entropy • The modern theory of information was developed by Claude Shannon in 1948. • The basis for most modern communication. A common application is ZIP files, which compress information. • Entropy is a measure of the uncertainty of the results of an event. Entropy = number of bits (binary, yes/no decisions) needed store or communicate the results. • The results of a coin flip, with 2 equally likely outcomes, needs 1 bit to describe. • Rolling a die, with 6 equal outcomes, needs somewhat more than 2 bits to describe. • Related to the concept of entropy in thermodynamics. • The entropy of an event (H) is the -1 time sum of the probability of each possible outcome (px) times the base 2 logarithm of that probability, • H = - pxlog2px • Units are bits. • plog2p = 0 by convention • Thus for a coin flip, pH = pT = 1/2. The base 2 log of 1/2 is -1, so H = -(1/2 -1 + 1/2 -1 ) = 1, or 1 bit of information. • For a 6-sided die, each possible outcome has a 1/6 probability. Log2(1/6) = -2.58, so rolling a die has H = -6 1/6 -2.58 = 2.58 bits of information.
Information Content • Outcomes with different probabilities affect the entropy. Entropy is maximal when all outcomes are equally likely. Entropy is 0 when there is only 1 possible outcome. • Imagine a loaded die, where the probability of a 6 is 1/2 and the probability of any other number is 1/10. • H = - (1/2log2(1/2) + 5·(1/10)log2(1/10) ) = -(0.5 + 5·(1/10) ·-3.321) = 1.661 bits • Compare this with a fair die, which has an entropy of 2.58 bits. The fair die’s outcome is much more uncertain than the loaded die. • The information of an event is the loss of uncertainty concerning an outcome. It is difference between the maximum possible entropy (with all equal outcomes) and the actual amount of entropy calculated with different outcomes having different probabilities.
Sequence Logos • A sequence logo is a visual representation of a PSSM, showing the relative importance of different positions and which residues contribute the most. • Based on Shannon information theory. • Consider a single position in a set of aligned protein sequences. • If all 20 amino acids are equally likely, the entropy of that position is Hmax = -20 (1/20)log2(1/20) = log2(1/20) = 4.32. • The information I of position u is I = Hmax - Hu. • I = 0 when all amino acids are equally likely. • If there is only 1 amino acid ever found at a position (completely conserved), there is no uncertainty about it, so its entropy is 0 and the information content is 4.32. • A more complicated example: say that this position has a 1/3 chance of being R and a 2/3 chance of being K. • Hu = -(1/3log2(1/3) + 2/3log2(2/3) ) = -(-.528 + -.390) = 0.918. • I = 4.32 - .918 = 3.402 bits. • For a sequence logo, the relative frequency of each amino acid is multiplied by the position’s information content, which is then converted into a height.
PSI-BLAST • Part of the BLAST programs available at NCBI • Finding new family members that don’t hit the original query • An iterative process: • first the database (usually nr) is searched with an initial query sequence, and all hits with e-values better than some cutoff (default = 0.005) are taken • these aligned sequences are used to construct a PSSM • The PSSM is then used to search the database again. • If new sequences better than the e-value cutoff are found, the PSSM is updated to include them, and the search is run again. • Eventually, no new sequences are found and the PSI-BLAST search is complete. • Considerably slower than regular BLAST • You have to manually do each iteration, at the top of the Descriptions area. • After 3 iterations with ORF00135 we get no more new hits. • With “conserved hypothetical protein” BMQ_0196 (next slide) we get new hits for at least 4 iterations, and also extensions on the length of match of many hits. • Most are hypothetical genes, but some mention possible functions. • Unfortunately, you can’t download the PSSM, but you can save it and re-use it if you like.
Another Sequence for PSI-BLAST >BMQ_0196 | QMB1551_chromosome:164387-165967 | conserved hypothetical protein MDKLMNRSWVMKIIALLLAFMLYLSVNLDDGASSSNKILNRSSSANTGVETLTDVPVQVS YNEKNRIVRGVPDTVIMTLEGPKNILAQTKLQKDYQAYIDLDNLSLGQHRVKVQYRNISD NLNVVVKPDIVNVTIEERDSKQFSVEASYDKNKVKNGYEAGEATVSPRAVTVTGASSQLD QVAYVKAIIDLDNASKTVTKQATVVALDKNLNKLNVTVQPETVNVTIPVRNISKKVPIDV IQEGTPGDGVNITKLEPKTDTVKIIGPSDSLEKIDKIDNIPVDVTGITKSKDIKVNVPVP DGIDSVSPKQITVHVEVDKQGDEKDAEETDASAAETKSFKNLPVSLTGQSSKYTYELLSP TSVDADVKGPKSDLDKLTKSGISLSANVGNLSAGEHTVPIIINSPDSVTSTLSTKQAKVR VTAKKQSGTNDEQTDDKETSGSTSDKETSGSTSDKETKPDTGTGSGTNPGTGNSGDSADK PSEETDTPEDNTDTPTDSTETGDDSSNQSDENSTPVDGQTDNTSGN