10 likes | 143 Views
Gene Finding in Viral Genomes. Stephen McCauley, Jotun Hein. Introduction
E N D
Gene Finding in Viral Genomes. Stephen McCauley, Jotun Hein Introduction Viral Genomes are small and exploit overlapping reading frames to code more compactly. In the regions of the genome where two or more genes overlap, the nucleotide composition differs from other regions due to the evolutionary constraints that coding in multiple frames imposes. In nature there are three classes of overlapping genes (see Figure 1): unidirectional where the 3’ end of one gene overlaps with the 5’ start of another gene in a different reading frame; convergent where the 3’ end of one gene overlaps with the 3’ end of another gene (reading in the opposing direction) and divergent where the 5’ start of one gene overlaps with the 5’ start of another gene (reading in the opposing direction). The above listing is roughly in accordance with the relevant preponderance of these types of overlapping genes in Nature in that unidirectional overlaps are more common that convergent which are more common that divergent (1: Rogozin et a 2002). Our methodology models unidirectional overlaps and is extended to allow for three genes which overlap in a unidirectional manner. It is more common than not to observe overlapping genes in a viral genome, and even in eukaryotic genomes (where there is not the same evolutionary constraints on genomic size) they are not uncommon. It is self evident that there are limitations to the amino acid coding capabilities of overlapping regions (a UUU encoding Phe may overlap in one reading frame with a UUA Leu in another, but a UUU Phe may not overlap with a GGG Gly). It may seem intuitive that these regions of overlap might be compositional biased in some manner and it is possible to examine these overlaps mathematically and propose likely biases. It has been illustrated that these biases are observed and we can utilize this information when we are predicting genes in viral genomes. Below we discuss briefly the nucleotide compositional constraints on unidirectional overlapping genes. We discuss the methodology that we have employed to predict genes in viral genomes. We employ a Hidden Markov Model framework which assigns a genomic annotation at the nucleotide level. It has been clearly illustrated with a simulation study the potential improvement that such a methodology may bring. We discuss these simulation results and leave work predicting on actual viral genomes for future publications. Nucleotide Compositional Constraints on Unidirectional Overlapping Genes It has been illustrated using information theory Entropy measurements that overlapping genes tend to have a more uniform nucleotide composition as compared with non-overlapping genes. In additional then tend to have higher order structuring which takes the form of a greater frequency of amino acid residues with a high level of degeneracy (2: Pavesi et al 1997). These observations are understood in terms of a mathematical analysis of the potential overlapping codon pairs (3: Kozlov 1999) and from evolutionary observations of viral genomes undergoing simultaneous positive and purifying selection on overlapping reading frames (4: Hughes et al 2001). Kozlov 2000 examines the set of potential overlapping amino acids. Random consideration of two codons in non-overlapping genes yields a space of 400 possible amino acid pairs. This space is reduced to only 80 possible amino acid pairs under some overlapping constraints. 50% of this space incorporates one of Ser,Leu or Arg as one of the encoded amino acids pairs. A more detailed examination of the potential 61*61 coding space (of which the amino acid pair space is a summary, excluding STOP codons) indicates that substantially more than 50% of the potential codon pairs encode Ser, Leu or Arg (unpublished). These amino acids are 6 fold degenerate at the codon level, and although we know that Nature often favours one or two of these degenerate codons, we would nevertheless be surprised were the nucleotide composition of overlapping unidirectional genes unbiased towards Ser, Leu , Arg rich (since the majority of overlapping codon pairs incorporate these vis-a-vis non overlapping genes in which only 18/61 codons code for Ser,Leu or Arg.) Hughes et al 2001 described the observed pattern of simultaneous positive selection in the tat gene of SIV and purifying selection in the corresponding overlapping region of the vpr gene. Nonsynonymous substitutions which altered the region of the tat gene which encoded an epitope were observed (positive selection indicative of and favouring immune escape). These nonsynonymous substitutions in the tat reading frame were associated with synonymous substitutions in the vpr reading frame. This evolutionary mode is only possible when amino acids with degrees of degeneracy are employed (See Figure 2). Amino acids which are multiply degenerate are involved in the greater proportion of the potential overlapping coding space. Overlapping regions which employ the greatest proportion of these offer increased flexibility for evolutionary adaptation under selection pressure, which perhaps explains their greater documented abundance (2: Pavesi et al 1997). A Hidden Markov Model for Explicit modelling of Unidirectional Overlapping Genes at the Nucleotide Level The composition of nucleotides in non-gene and gene regions differ. Furthermore the nucleotide composition in genes whose reading frames overlap differ from conventional non-overlapping genes. We define an HMM with 8 active states as follows: One non-gene state, 3 single gene states (one for each of 3 unidirectional reading frames), 3 paired overlapping gene states (genes in reading frames 1&2, 2&3 and 1&3) and a triple overlapping state. Each nucleotide emits from each of these states according to a defined conditional probability emission distribution, and transitions between states (from one nucleotide to another) are permitted according to a set of defined conditional transition probability matrices. We shall examine one particular example of a transition matrix to serve as an illustration of how the model operates. Consider Figure 3. Figure 3 is concerned with the first nucleotide position in reading frame 2 (so nucleotide loci 2,5,8, etc). State 1 is the Non-Gene state and if the previous nucleotide were in this state then it is possible that the nucleotide under consideration could describe the first position in a codon in reading frame 2. This is described as State 3 and so there is a defined probability of transitioning to this State 3 from State 1 (there is also the probability of remaining in State 1). Consider State 8 which represents the triple overlapping gene state. Were the previous nucleotide in this state then the HMM could remain in this state (by continuing with a new codon in reading frame 2) or the HMM could leave the gene state in reading frame 2 and transition to the doubly overlapping gene State 6 (which represents a gene in reading frame 1 overlapping with a gene in reading frame 3). The transition matrix in Figure 3 is populated with stars which denote non-zero probabilities. We have the star notation because we further condition these state transition matrices on whether the previous nucleotide triple could represent a START codon or a STOP codon or NONE. There are further nuances which need to be employed to start and end the HMM in a consistent manner, but the above description represents the crux of the model. We employ the Viterbi and Posterior Decoding procedures to infer the most likely genomic state annotation, and the most likely annotation state for every individual nucleotide. Obviously these annotations are only optimal in so far as (a) the parameters of the HMM describe Nature; (b) the HMM is a suitable model (for example this methodology implicitly assumes that gene length is geometrically distributed) (c) We do not model introns in this methodology and since we are annotating viral genomes and introns are less common than in eukaryotic genomes we suspect this should not be a major weakness however the model can always be extended if deemed necessary. Simulation Results and Suggestions for Further Work We parameterised an HMM as described above using HIV1 sequence as a guide. From this HMM we simulated many thousands of genomes of length 10,000 (approximately the length of the HIV genome). We then annotated the sequences with the Viterbi and Posterior Decoding algorithms and compared these annotations with the known simulated state sequences. Using this methodology greater than 98% of gene nucleotides were correctly annotated using either the Viterbi or Posterior Decoding procedures. Of course these results are annotating sequences generated according to the HMM model and known parameters and so would likely serve as a maximum level of annotative performance on real genomes where neither these conditions are necessarily true. We also designed a simpler gene finder where the overlapping gene regions were not explicitly modelled and genes were annotated in separate reading frames and then combined in a final annotation. Using this simplified model, both Viterbi and Posterior Decoding procedures performed very poorly (the simulated data had many overlaps typical of viral genomes), which encourages us that our hypothesis, that modelling gene overlaps explicitly is an important consideration in viral gene prediction, is likely correct. We are currently applying this methodology to actual viral genomes where realistic performance of the procedure can be ascertained. There are several extensions to the procedure that we already wish to apply. Modelling introns explicitly may be necessary and we also may be able to use first pass annotations to help parameterize the HMM in a genome specific fashion (some EM procedure may work well with such a starting point). We would like to employ some type of evolutionary model to help annotate a multiple alignment of viral genomes where the phylogeny may be well documented. 1:Purifying and directional selection in overlapping prokaryotic genes. Rogozin IB,Spiridonov AN,Sorokin AV,Wolf YI,Jordan IK,Tatusov RL,Koonin EV. Trends Genet.2002 May;18(5):228-32. 2:On the informational content of overlapping genes in prokaryotic and eukaryotic viruses.Pavesi A,DeIaco B,Granero MI,Porati A. J Mol Evol. 1997 Jun;44(6):625-31. 3:Analysis of a set of overlapping genes.Kozlov NN. Dokl Biochem.2000 Jul-Aug;373(1-6):119-22. 4:Simultaneous positive and purifying selection on overlapping reading frames of the tat and vpr genes of simian immunodeficiency virus.Hughes AL, Westover K, da Silva J, O'Connor DH, Watkins DI. J Virol. 2001 Sep;75(17):7966-72.