310 likes | 544 Views
Gene Prediction. Increase in the recent 12 months. Motivation. The genome sequencing projects are progressing very fast: human, mouse, rat, rise, anopheles… The identification of genes is the next important step in the analysis of genomes. How can we identify genes in the sequence?.
E N D
Motivation • The genome sequencing projects are progressing very fast: human, mouse, rat, rise, anopheles… • The identification of genes is the next important step in the analysis of genomes. • How can we identify genes in the sequence?
Gene Identification • Identify genes coding for known proteins. • only few proteins are known • Identify genes based on homology with other genomes. • Identifying genes based on gene characteristics.
Gene Characteristics • Differences between prokaryotic and eukaryotic gene characteristics: • Prokaryotes - the genome is more compact. several genes may reside on the same mRNA in different reading frames. • Eukaryotes - a gene may contain introns. • The human genome: average gene ~ 27,800b. • exon ~ 100b. intron 100-30,000 b. • Promoter regions are different – in prokaryotes the signals are more conserved. • Differences between different types of genes in the same genome. Each type has its own characteristics. • Differences between prokaryotic and eukaryotic gene characteristics:
Prediction Approaches • The problem of gene prediction is very much open even in well studied genomes • The number of genes in yeast keeps changing. • The identification of promoter regions in E. Coli is considered a great challenge of bioinformatics. • Next we consider prediction of the following: • Protein coding genes (ORFs). • Functional RNA coding genes.
ORF Finding • Open Reading Frames – sequences that code for proteins. • How can ORFs be detected? • All reading frames are checked. • Search for initiation and termination codons within a sequence.Are these codons totally conserved?
Prediction of Protein-Coding Genes • Three types of post-transcriptional events make prediction difficult: • genetic code; alternative splicing; RNA editing. • DNA is not a random choice of possible codons for each amino acid. It is an ordered list of codons that reflects evolutionary origin and constraints related to gene expression. • Each species has its own coding preferences – codon usage.
General Codon Preferences • Codon usage is different in genes coding for highly /weakly expressed genes. • in E. Coli genes were divided into 3 groups based on their codon usage– • - regular genes (70%) • - highly expressed genes (15%) - horizontally transferred genes (15%) • There is strong preferences in ORFs for specific codon pairs and for specific codons near terminators. • The base in the third position in each codon tends to repeat itself in the same ORF.
Signal Based Identification • Prokaryotes - signals such as the RBS – Ribosome Binding Site (Shine-Dalgarno) – are conserved. • Located ~ -15 upstream AUG. (in B. Subtilis RBS is AGGAGG) • Eukaryotes • Transcription signalsTATA (~-30 TSS), cap signal, poly-adenylation site. Any signal may be missing. • Translation signalsKozak signal (immediately upstream ATG), termination codon. • Splicing signals - the Spliceosome recognizes: donor and acceptor sites - introns usually start with GT and end with AG. • branch point – inside the intron.
Prediction Reliability Tests • Where no experimental verification is available, reliability of prediction can be measured by: • Third base repeat in an ORF - does not require any prior knowledge • Codon usage - requires prior knowledge per species. • Predicted-protein sequence comparison - if homologs are found prediction is more reliable.Homologs can be searched in protein databases, EST databases, cDNA databases, etc.The quality of the results depends on the quality of the database (EST – error prone).
Computational Approaches to Prediction • Gene prediction is carried out by various computational methods including decision trees, neural nets, Markov models and Hidden Markov models (HMM). • A model is studied based on known genes, and then applied to genomic sequences. • Each genome defines its own model.
Markov Models – Probabilistic Approach Markov model - can be modeled by states and the probability of transition from one state to the next. Markov chain – progresses in steps; each step corresponds to a move between states. The probability of being at state X in step i depends only on the state we reached at step i-1. It has been found that ORFs have a reading-frame specific hexamer (6mer) composition. => the probability of the 6th base can be computed using the previous 5.=> The probability that a sequence is an ORF in a specific reading frame can be computed.
Finding the genes in genomic DNA.Chris Burge and Samuel KarlinCurr Opin Struct Biol. 1998 Jun;8(3):346-54. Review.
Markov Models – Figure Legend • Circle represent DNA bases or states. Numbers indicate codon positions. Arrows indicate dependency. • Three periodic 5th order Markov models. The next base is generated conditionally on the previous 5 bases and on the codon position. • Homogenous 5th order Markov Model. • Hidden Markov model. Upper circles represent hidden states, corresponding to whether the position is coding or non coding; upper arrows indicate that the states are generated according to a first order Markov Model. Lower circuits correspond to DNA bases. lower arrows indicate that each base is generated conditionally on the identity of the hidden state. • As c) with variable lengths of the hidden states.
Prediction of Complete Gene Structures in Human Genomic DNA,Chris Burge and Samuel KarlinJMol Biol. 1997 Apr 25;268(1):78-94.
Gene Prediction Tools • Glimmer at TIGR (The Institute of Genetics Research). • GeneMark at Georgia Tech. • Grail at Oak Ridge National Laboratory • Genefinder at Baylor College of Medicine • Genscan at MIT • Prediction tools are compared using two criteria: • Sensitivity - % true predicted genes out of the true genes in the genome. • Specificity - % true predicted genes out of the total number of predicted genes. • Both need to be high (correlation tests ~ 0.7-0.9).
The General Scheme • Obtain new genomic DNA sequence. • A) Translate in all 6 reading frames and compare to protein databases. • b) Perform database similarity search of expressed sequence tags (EST) database of same organism, or cDNA sequences if available. • 3. Use gene prediction program to locate genes. • 4. Analyze regulatory sequences in the gene (signals). • Can help characterize putative genes.
Functional RNA Genes • RNA genes are transcribed but are not translated – no codon preference exists.How can rRNA, tRNA and small RNA genes be predicted? • Promoter regions can be characterized, but remain a big challenge. • RNA secondary structure is important.Can be predicted using RNA structure prediction tools (MFOLD tool).
Characteristics of E. coli promoters Transcription start site -35 hexamer -10 hexamer spacer interval TTGACA TATAAT 15 to 19 bases 5 to 9 bases Actual promoters exhibit large sequence variation. Upon predicting promoters – known ones are missing, many false ones emerge.
Characteristics ofrho-independent terminators Loop of 3-8 bases Stem loop structure free energy below -7 kcal/mole Stem of 5-10 base pairs at least 60% GC At least 4 U residues 5’ UUUU 3’
Identifying Small RNAs • The role of small RNAs (sRNA) is a hot topic in current biology.sRNA genes fill many regulatory functions, e.g., regulating translation of mRNA (antisense).They are hard to find experimentally. • A group of researchers from the Hebrew University and from Sweden combined bioinformatic predictions with experimental verification.Argaman et. Al – Current Biology 2001.
Identifying Small RNAs • Based on 10 known sRNA in E. Coli they predicted 24 sRNAs, of which 14 were experimentally verified. • 3 successive studies identified ~ 20 more sRNA genes in E. Coli.
-35 -10 Promoter +1 -35 -10 50-400 bases Promoter +1 Terminator Predictive scheme “Empty” regions ORF A Locate “empty” regions in the E. coli genome ORF C ORF B Search for promoter DNA sequences recognized by s70 of RNA polymerase Identify rho-independent terminators TTTT Extract sequences in which the distance between the promoter and the terminator is 50 to 400 bases. Check sequences for conservation in other bacteria