ORF identification in Allgenes Project

ORF identification inAllgenes Project Vladimir Babenko, Brian Brunk, Jonathan Crabtree, Li Li, Christian Stoeckert Center for Bioinformatics University of Pennsylvania

AllGenes (www.allgenes.org) AllGenes is a gene index project created at the University of Pennsylvania based on EST and mRNA sequences. Currently the AllGenes project contains up to a million assembled nucleic acid sequences (assemblies) for mouse and human. These assembly sequences were generated from the set of EST and mRNA in GenBank (as of August, 2001) using the CAP4 program (Paracel). Given the need to process hundreds of thousands of assembled transcripts it is of utmost importance to be efficient in ORF construction.We are interested in ORF for the following reasons: • Application BLASTP is more effective than BLASTX in terms of computational cost. • Ability to use the translations as candidates for prediction function and properties, especially in the case of unknown protein. • Correction of putative errors. Thus, it is of importance to provide a computer program with reasonable combination of accuracy and speed.

Allgenes Project at CBIL: input and output statistics Input sequences Resulting assemblies

Problem area and approaches for ORF reconstruction • Find the most efficient and accurate translation of the source nucleic acid sequence • Approaches include dynamic programming, HMM, Neural networks. • Features of ORF reconstruction methods include use of codon (di-codon) preference and start codon context

Comparison of Computer Programs4 programs were tested on the 1000 assembly consensi from AllGenes, 1000 longest mRNAs contained in the same set of assemblies and the same set of mRNAs with 2 artificially generated frameshifts (insertions) per sequence. The results are presented in Table 1. Considering speed/accuracy ratio, we chose framefinder as the most appropriate program.

Method for assessing the significance of observed vs expected length of ORF The null hypothesis is that the nucleotide sequence of length Nnuc is non-coding and it’s supposed that there is no significant compositional bias in it. Then the probability of stop codon in translated sequence is q=3/64, the non-stop codon frequency is p=61/64, correspondingly. Let’s assess the significance of a particular ORF of length Lmax. Number of the trials for ORF reconstruction for 6 frames to scan is:Npep = 6*([Nnuc/3] – Lmax+1); Given the Ymax is the longest ORF from all ones found and is considered as test statistic, it follows geometric distribution: P(Ymax ) = 1 – (1-Pl)n, (1)where n is the number of ORFs correspondent to the number of stop codons, and could be approximated by average number of stop codons in a trial space, that is 6-frame translated nucleotide sequence: n =q Npep. Then for the observed Lmaxlength P-value is followed from (1):P(Lmax ) = 1- (1-pLmax)qNpep (2)In turn, if (2) could be well approximated with Poisson distribution and converted to a simplified form:P(Lmax) = 1- e –pLmax qNpep(3)

Flowchart of the Framefinder plug-in(Plugin is an application that deals with AllgenesOracle Layer via specialized OO Perl library created at CBIL) FrameFinder ORF P<0.2? Triv. trans yes no Length >3ff? Adjust gap penalty: F=-c+ln(p_value) Creation of trace file: H = -P(dash)*logP(dash) -P(^dash)*logP(^dash) ATG? yes Diana Submit to Allgenes

Statistics for ORFs generated by framefinder on 363520 mouse assemblies

Distribution of ORF length for 71709 mouse non-singletons (truncated at x=500; x = 5377 is the maximal value). Median value is x=41.

Example of BlastP program applied to the ORF with frameshift and nrdb database. In red is a frameshift corrected by framefinder. We conclude from this plot that a frameshift was introduced adequately • Blastp version • Query= 6041361 (691 letters) • >gi|13878191|ref|NP_113550.1| opioid growth factor receptor; RIKEN cDNA 2010013E17 gene [Mus musculus] gi|12667207|gb|AAK01353.1|AF303894_1 (AF303894) opioid growth factor receptor [Mus musculus] Length = 633 • Score = 1031 bits (2667), Expect = 0.0 • Identities = 612/632 (96%), Positives = 614/632 (96%), Gaps = 18/632 (2%) • Query: 421 VANEVRKRRKVEEGAEGDGVASNTQVQASALSPTPSECPESQKDGNGPEDPKSQVGPEDP 480 • VANEVRKRRKVEEGAEGDGVASNTQVQASALSPTPSECPESQKDGNGPEDPKSQVGPEDP • Sbjct: 421 VANEVRKRRKVEEGAEGDGVASNTQVQASALSPTPSECPESQKDGNGPEDPKSQVGPEDP 480 • Query: 481 KSQVGPEDPKSQVGPEDPKAQVGPEDPKGQVE------------------PEDPKGQVGP 522 • KSQVGPEDPKSQVGPEDPK+QVGPEDPKGQVE PEDPKGQVGP • Sbjct: 481 KSQVGPEDPKSQVGPEDPKSQVGPEDPKGQVEPEDPKGQVGPEDPKGQVGPEDPKGQVGP 540 • Query: 523 EDPKSQVGPEDPKSQVEPEDPKSQVEPEDPKSQVEPEDPKSQVGPEDPQSQVGPEQAASK 582 • EDPKSQVGPEDPKSQVEPEDPKSQVEPEDPKSQVEPEDPKSQVGPEDPQSQVGPEQAASK • Sbjct: 541 EDPKSQVGPEDPKSQVEPEDPKSQVEPEDPKSQVEPEDPKSQVGPEDPQSQVGPEQAASK 600 • Query: 583 SLGEDPDSDTTGTSMSESEELARIEASVEPPK 614 • SLGEDPDSDTTGTSMSESEELARIEASVEPPK • Sbjct: 601 SLGEDPDSDTTGTSMSESEELARIEASVEPPK 632

Assessing the framefinder output by comparison blast results against nrdb, p<0.05 Comparison of top blast hits (subjects) for assemblies that both have hits in mRNA (BLASTP:BLASTX)

Blastp alignment of long ORFs for assembly 6054719 with no correction against protein nrdb. A) Alignment with corrected ORF (1240 letters), 1006bp of target protein is covered; the same result is obtained wih blastx alignment of na assembly B) Initial ORF (1462 letters) produced by framefinder; 585 letters only correspond to the target protein

Illustration that trivial translation is sometimes the best ;a) ORF 6044285 without any correction blasted against nrdb; B) protein with correction was blasted against nrdb; C) Blastx alignment of assembly 496991against protein nrdb. Whole target protein is covered

DEATH domain identificationFigure 8. Alignment of protein with no protein similarities (GUS ACC: 6282896) against:A.nrdb database. It could be seen that very poor similarity (1e-4) is obtained;B. Pfam database. It looks that non-random domain exists within this protein (1e-6) A. uery= 6282896 (417 letters) Da Database: nr 736,524 sequences; 233,319,389 total letters 7|pir||T14892 transcription factor NF-kappaB - sea urchin (Strongylocentrotus purpuratus)gi|4165051|gb|AAD08653.1| (AF064258) NFkB [Strongylocentrotus purpuratus] Length = 1125 Score = 48.5 bits (114), Expect = 1e-04 Identities = 26/74 (35%), Positives = 43/74 (57%), Gaps = 1/74 (1%) Query: 12 EQLQMLLEPNSVTGNDWRRLASHLGLCGMKIRFLSCQRSPAAAILELFEEQNGSLQELHY 71 E+L +L+ N T W LA+ LGL M + FL SP A IL+ FE +G+++EL Sbjct: 1012 EKLGSMLDDNYPTTQSWFTLANRLGLSNM-LNFLKLVPSPTAVILKQFEAMDGTIKELRD 1070 Query: 72 LMTSMERLDCASAI 85 +++SM ++ + + Sbjct: 1071 VLSSMNHIEAVALL 1084 B. Subject: gnl|Pfam|pfam00531, death, Death domain. CD-Length = 83 residues, 91.6% aligned Score = 47.0 bits (110), Expect = 2e-06 Query: 21 NSVTGNDWRRLASHLGLCGMKIRFLSCQ----RSPAAAILELFEEQ---NGSLQELHYLM 73 Sbjct: 8 DDPLGRDWRRLARKLGLSESEIDQIENEYPRLASPTYQLLDLWEQRGGKNATVGTLLQAL 67 Query: 74 TSMERLDCASAIQNYL 89 Sbjct: 68 RKMGRRDAVELLESAL 83

Conclusions • We have chosen framefinder as the most appropriate application from the set of considered. • In 98% of cases the blastp subjects for ORFs with p<0.05 were consistent with the blastx subjects. This high degree of consistency provides an internal check of the validity of the translations. Cases where ORFs had no homology to known proteins (11% of the total set with p <0.05) are therefore not likely to be artifacts. • A significant percentage of assemblies don’t contain prominent protein sequence, probably due to containing short or proteins; these could also be non-coding mRNAs or 5’/3’ truncated products. • We found at least some cases when trivial translation is more efficient than framefinder. We provide trivial translation in cases of a) poor framefinder performance b) long sequence with frameshifts. • Providing the most probable ORF gives a possibility to analyze novel proteins and correct the frameshifts.

ORF identification in Allgenes Project