140 likes | 333 Views
Improved techniques for the identification of pseudogenes. L. Coin and R. Durbin. Introduction. Pseudogene? Sequences Originally derived from functional genes. No longer translated into functional protein products. ~20,000 human pseudogenes. Two types Unprocessed pseudogene.
E N D
Improved techniques for the identification of pseudogenes L. Coin and R. Durbin
Introduction • Pseudogene? • Sequences • Originally derived from functional genes. • No longer translated into functional protein products. • ~20,000 human pseudogenes. • Two types • Unprocessed pseudogene. • From genome duplication. • Subsequently lost its function . • Rapid degeneration observed in prokaryotes. (Not eukaryotes) • Processed pseudogene • From reverse transcription (no intron). • Role • Regulatory role has been observed for human pseudogenes.
Introduction • Mis-annotation problem • Current approaches • Presence of stop codon and frameshift mutation. • But, half of human pseudogenes have no detectable framshifts or internal stop codons. • Ratio of synonymous and non-synonymous substitution. • Not enough to accurately find pseudogenes.
Algorithm • New approach • Look at pattern of substitution in conserved protein domains • Algorithm • Input • Alignment A • Unrooted tree T • Profile HMM D • Output • Score for each leaf-node which represents the belief that the node is a pseudogene.
Algorithm • Alignment (ClustalW) • Xn. : row corresponding to leaf-node n. • X.i : i-th column. • A\Xn. : Alignment A excluding Xn. • Tree (Neighbor joining tree) • mj : j-th match column of profile HMM. • pn : parent node of n. • bn : brach from pn to n. • T\bn : Tree T excluding bn.
Algorithm • Assumption • Null model : protein domain evolution on the tree • Test if • The final branch to the query node evolved by alternative drift models • Score for branch b is • Log-odds ratio of • Neutral (non-coding) DNA (Pnuc(b)) and Null Pfam domain model (Pdom(b)). • Protein coding (Pprot(b)) and Null Pfam domain model (Pdom(b)).
Algorithm • PSILC score • Cnuc = {Pnuc(bn), Pdom(T\bn)} : neutral DNA on bn otherwise domain encoding. • Cprot = {Pprot(bn), Pdom(T\bn)} : protein coding on bn otherwise domain encoding. • Cdom = {Pnuc(bn), Pdom(T\bn)} : domain encoding on all T, including bn.
Algorithm • Assumption • Xni in the row Xn is conditionally independent of other entries Xni' (i'≠i).(given other rows of the alignment A\Xn, tree T and constraint Ck) • (3) assumes that Xni is conditionally independent of all other columns in the alignment (given X.i\Xni, tree T and constraint Ck) • (4) uses the tree property that a leaf-node is conditionally independent of all other nodes in the tree given its parent.
Algorithm • From (4) • Calculate the frequency distribution at the parent node given the constraints on all branches excluding the branch to the query node. • For each possible base at the parent node, calculate the transition probability to the child node assuming the appropriate evolutionary constraints on the branch to the child node. • First calculation above, • Construct a new tree • Re-rooting the tree at the parent node pn. • Remove the branch to the node n. • In (6) • First term is likelihood of the reduced alignment conditional on each possible base at the root of T\bn. • Second term is prior probability at the root given the evolutionary constraints.
Algorithm • Prior distribution at root • Use equilibrium distribution of rate matrix. • Observed frequencies in the alignment for nucleotide and amino acid models. • Emission frequency distribution of match state for domain model. • Transition probability between different bases given different evolutionary constraints. • Pk(t) : matrix of transition probability at time t under the evolutionary constraint Pk. • Q : rate matrix • r : rate parameter • For amino acid models, : database estimates. • For nucleotide models, • : Parameterized model (e.g. HKY model, introduce parameter ). • : uniform distribution. • Free parameter f : trade-off between frequencies in the equilibrium distribution resulting from pressure to mutate from (f=1) and pressure to mutate toward (f=0) a particular base. • Calculate the values of r, f, which maximize the likelihood of alignment A given the tree.
Algorithm • Directionality of the calculation • Score on an alignment of two transcripts x1, x2 is not symmetric. • If base X1i is more likely than X2i at a particular match state but equally likely under the protein model, score for x2. being a pseudogene is higher than score for x1. . • dN/dS does not have this property.
Results • Test Data • 598 coding transcripts. • 97 pseudogenes. • Only apply when a Pfam domain can be aligned. • 68%/61% of coding trnascripts/pseudogenes. • PSILCprot/dom out-performs all other methods. • Expected PSILCnuc/dom out-performs PSILprot/dom. -> Not True! • Somehow penalize DNA evolution relative to protein evolution.
Discussion • Applicable to large-scale analysis • Quality check on the gene annotation databases to identify potential pseudogenes. • A scan of various genomes for pseudogenes. • Analysis of functional DNA constraints on pseudogenes. • Future work • Infer loss of constraint along an entire clade of a tree. • Score mutation to predict the potential loss of functionality from a SNP. • Potential problem • Genes under positive selection will be misclassified as pseudogenes.