250 likes | 475 Views
Gene Finding. Charles Yan. Gene Finding. C ontent s ensors Extrinsic content sensors Compare with protein sequences Compare with cDNA and ESTs Genomic comparisons Intrinsic content sensors Prediction methods S ignal sensors. Intrinsic content sensors.
E N D
Gene Finding Charles Yan
Gene Finding Content sensors • Extrinsic content sensors • Compare with protein sequences • Compare with cDNA and ESTs • Genomic comparisons • Intrinsic content sensors • Prediction methods Signal sensors
Intrinsic content sensors • Originally, intrinsic content sensors were defined for prokaryotic genomes. • In such genomes, onlytwo types of regions are usually considered: the regions thatcode for a protein and will be translated, and intergenicregions.
Intrinsic content sensors • Since coding regions will be translated, they arecharacterized by the fact that three successive bases in thecorrect frame define a codon which, using the genetic coderules, will be translated into a specific amino acid in the finalprotein.
Intrinsic content sensors • In prokaryotic sequences, genes define (long) uninterruptedcoding regions that must not contain stop codons. • Therefore,the simplest approach for finding potential coding sequences isto look for sufficiently long open reading frames (ORFs),defined as sequences not containing stops, i.e. as sequencesbetween a start and a stop codon.
Intrinsic Content Sensors In eukaryotic sequences,however, the translated regions may be very short and theabsence of stop codons becomes meaningless.
Intrinsic Content Sensors Several other measures have therefore been defined that tryto more finely characterize the fact that a sequence is `coding‘for a protein: • Nucleotide composition and especially (G+C)content (introns being more A/T-rich than exons, especially inplants) • Codon composition • Hexamer frequency
Codon Composition In random DNA Leucine : Alanine : Tryptophan = 6 : 4 : 1
Codon Composition • Compare to the background frequency
Hexamer Frequency Among the large variety of codingmeasures that have been tested, hexamer usage (i.e. usage of6 nt long words) was shown in 1992 to be the mostdiscriminative variable between coding and non-codingsequences
Intrinsic Content Sensors • In general, most currently existing programs use two typesof content sensors: one for coding sequences and one for noncodingsequences, i.e. introns, UTRs and intergenic regions. Afew software refine this by using a different model for thedifferent types of non-coding regions (e.g. one model forintrons, one for intergenic regions and an optional specific 3’-and 5’-UTR model in EuGene).
Gene Finding Content sensors • Extrinsic content sensors • Intrinsic content sensors Signal sensors
Signals • Transcription (transcriptionfactor binding sites and TATA boxes) • Splicing(donor and acceptor sites and branch points) • Polyadenylation[poly(A) site], • Translation (initiation site, generally ATG withexceptions, and stop codons)
Signal Sensors • Splice site prediction • Promoter prediction • Poly(A) sites prediction • Translation initiation codon prediction
Splice site prediction • The basic and natural approach to finding a signal that mayrepresent the presence of a functional site is to search for amatch with a consensus sequence (with possible variationsallowed), the consensus being determined from a multiplealignment of functionally related documented sequences. • e.g. for splice site predictionsSPLICEVIEW and SplicePredictor
Splice site prediction • A more flexible representation of signals is offered by theso-called positional weight matrices (PWMs), which indicatethe probability that a given base appears at each position of thesignal (again computed from a multiple alignment offunctionally related sequences). • The PWM weights can also beoptimized by a neural network method. e.g. NetPlantGene and NetGene2
Splice site prediction • In order to capture possible dependencies between adjacentpositions of a signal, one may use higher order Markovmodels or hidden Markov models. • VEIL, MORGAN, and NetGene2
Splice site prediction When using splice site prediction programs, oneends up with a list of potential splice sites, from which variousgene structures may be built. The main purpose of suchprograms is not to find the gene structure but to try to find thecorrect exon boundaries. They are thus very useful in additionto an exon or gene predictor in order to refine an existing genestructure.
Signal Sensors HMMs have also been used to represent other typesof signals, such as poly(A) sites and promoters. Promoter predictions deserve another chapter.
Signal Sensors Another important signal to identify when trying to predicta coding sequence is the translation initiation codon. A fewprograms exist specifically dedicated to this problem, but most of them have a rather limited efficiency, which ismaybe related to the lack of proper learning sets for eukaryoticgenomes.
Gene Finding Content sensors • Extrinsic content sensors • Intrinsic content sensors Signal sensors • Splice site prediction • Promoter prediction • Poly(A) sites prediction • Translation initiation codon prediction Combining the evidence to predict gene structures