Chapter 8 Gene Prediction

Chapter 8 Gene Prediction

Automated sequencing of genomes require automated gene assignment • Includes detection of open reading frames (ORFs) • Identification of the introns and exons • Gene prediction a very difficult problem in pattern recognition • Coding regions generally do not have conserved sequences • Much progress made with prokaryotic gene prediction • Eukaryotic genes more difficult to predict correctly

Ab initio methods • Predict genes on given sequence alone • Uses gene signals • Start/stop codon • Intron splice sites • Transcription factor binding sitesribosomal binding sites • Poly-A sites • Codon demand multiple of three nucleotides • Gene content • Nucleotide composition – use HMMs • Homology based methods • Matches to known genes • Matches to cDNA • Consensus based • Uses output from more than one program

Prokaryotic gene structure • ATG (GTG or TTG less frequent) is start codon • Ribosome binding site (Shine-Dalgarno sequence) complementary to 16S rRNA of ribosome • AGGAGGT • TAG stop codon • Transcription termination site (-independent termination) • Stem-loop secondary structure followed by string of Ts

Translate sequence into 6 reading frames • Stop codon randomly every 20 codons • Look for frame longer that 30 codons (normally 50-60 codons) • Presence of start codon and Shine-Dalgarno sequence • Translate putative ORF into protein, and search databases • Non-randomness of 3rd base of codon, more frequently G/C • Plotting wobble base GC% can identify ORFs • 3rd base also repeats, thus repetition gives clue on gene location

Markov chains and HMMs • Order depends on k previous positions • The higher the order of a Markov model to describe a gene, the more non-randomness the model includes • Genes described in codons or hexamers • HMMs trained with known genes • Codon pairs are often found, thus 6 nucleotide patterns often occur in ORFs – 5th-order Markov chain • 5th-order HMM gives very accurate gene predictions • Problem may be that in short genes there are not enough hexamers • Interpolated Markov Model (IMM) samples different length Markov chains • Weighing scheme places less weight on rare k-mers • Final probability is the probability of all weighted k-mers • Typical and atypical genes

GeneMark (http://exon.gatech.edu/genemark/) Trained on complete microbial genomes Most closely related organism used for predictions Glimmer (Gene Locator and Interpolation Markov Model) (http://www.cbcb.umd.edu/software/glimmer/) FGENESB (http://linux1.softberry.com/) 5th-order HMM Trained with bacterial sequences Linear discriminant analysis (LDA) RBSFinder (ftp://ftp.tigr.org )Takes output from Glimmer and searches for S-D sequences close to start sites

Performance evaluation • Sensitivity Sn = TP/(TP+FN) • Specificity Sp = TP/(TP+FP) • CC=TP.TN-FP.FN/([TP+FP][TN+FN][TP+TN])1/2

Gene prediction in Eukaryotes Low gene density (3% in humans) Space between genes very large with multiply repeated sequences and transposable elements Eukaryotic genes are split (introns/exons) Transcript is capped (methylation of 5’ residue) Splicing in spliceosome Alternative splicing Poly adenylation (~250 As added) downstream of CAATAAA(T/C) consensus box Major issue identification of splicing sites GT-AG rule (GTAAGT/ Y12NCAG 5’/3’ intron splice junctions) Codon use frequencies ATG start codon Kozak sequence (CCGCCATGG)

Ab initio programs • Gene signals • Start/stop • Putative splice signals • Consensus sequences • Poly-A sites • Gene content • Coding statistics • Non-random nucleotide distributions • Hexamer frequencies • HMMs

Discriminant analysis • Plot 2D graph of coding length versus 3’ splice site • Place diagonal line (LDA) that separates true coding from non-coding sequences based on learnt knowledge • QDA fits quadratic curve • FGENES uses LDA • MZEF(Michael Zang’s Exon Finder uses QDA)

Neural Nets • A series of input, hidden and output layers • Gene structure information is fed to input layer, and is separated into several classes • Hexamer frequencies • splice sites • GC composition • Weights are calculated in the hidden layer to generate output of exon • When input layer is challenged with new sequence, the rules that was generated to output exon is applied to new sequence

HHMs • GenScan (http://genes.mit.edu/GENSCAN.html)5th-order HMM • Combined hexamer frequencies with coding signals • Initiation codons • TATA boxes • CAP site • Poly-A • Trained on Arabidopsis and maize data • Extensively used in human genome project • HMMgene (http://www.cbs.dtu.dk/services/HMMgene) • Identified sub regions of exons from cDNA or proteins • Locks such regions and used HMM extension into neighboring regions

Homology based programs • Uses translations to search for EST, cDNA and proteins in databases • GenomeScan (http://genes.mit.edu/genomescan.html) • Combined GENSCAN with BLASTX • EST2Genome (http://bioweb.pasteur.fr/seqanal/interfaces/est2genome.html) • Compares EST and cDNA to user sequence • TwinScan • Similar to GenomeScan

Consensus-based programs • Uses several different programs to generate lists of predicted exons • Only common predicted exons are retained • GeneComber (http://www.bioinformatics.ubc.ca/gencombver/index.php) • Combined HMMgene with GenScan • DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi) • Combines FGENESH, GENSCAN and HMMgene

Accuracy

Chapter 9 Promoter and regulatory element prediction

Promoters are short regions upstream of transcription start site • Contains short (6-8nt) transcription factor recognition site • Extremely laborious to define by experiment • Sequence is not translated into protein, so no homology matching is possible • Each promoter is unique with a unique combination of factor binding sites – thus no consensus promoter

Prokaryotic gene TF site polymerase TF ORF -35 box -10 box • 70 factor binds to -35 and -10 boxes and recruit full polymerase enzyme • -35 box consensus sequence: TTGACA • -10 box consensus sequence: TATAAT • Transcription factors that activate or repress transcription • Bind to regulatory elements • DNA loops to allow long-distance interactions

Eukaryotic gene structure TF site Pol II TF site TATA Inr Polymerase I, II and III Basal transcription factors (TFIID, TFIIA, TFIIB, etc.) TATA box (TATA(A/T)A(A/T) “Housekeeping” genes often do not contain TATA boxes Initiatior site (Inr) (C/T) (C/T) CA(C/T) (C/T) coincides with transcription start Many TF sites Activation/repression

Ab initio methods • Promoter signals • TATA boxes • Hexamer frequencies • Consensus sequence matching • PSSM • Numerous FPs • HMMs incorporate neighboring information

Promoter prediction in prokaryotes • Find operon • Upstream offirst gene is promoter • Wang rules (distance between genes, no -independent termination, number of genomes that display linkage) • BPROM (http://www.softberry.com) • Based of arbitarry setting of operon egen distances • 200bop uopstream of first gene • ‘many FPs • FindTerm (http://sun1.softberry.com) • Searches for -independent termination signals

Prediction in eukaryotes • Searching for consensus sequences in databases (TransFac) • Increase specuificity by searching for CpG islands • High density fo trasncription factor binding sitres • CpGProD (http://pbil.univ-lyon1.fr/software/cpgprod.html) • CG% inmoving window • Eponine (http://servlet.sanger.ac.uk:8080/eponine/ ) • Matches TATA box, CCAAT bvox, CpG island to PSSM • Cluster-Buster (http://zlab.bu.edu/cluster-buster/cbust.html) • Detects high concentrations of TF sites • FirstEF (http://rulai.cshl.org/tools/FirstEF/) • QDA of fisrt exonboundary • McPromoter (http://genes.mit.edu/McPromoter.html) • Neural net of DNA bendability, TAT box,initator box • Trained for Drosophila and human sequences

Phylogenetic footprinting technique • Identify conserved regulatory sites • Human-chimpanzee too close • Human fish too distant • Human0-mouse appropriate • ConSite (http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite) • Align two sequences by global; alignment algorithm • Identify conserved regions and compare to TRANSFAC database • High scoring hits returned as positives • rVISTA (http://rvista.dcode.org) • Identified TRANSFAC sites in two orthologous sequences • Aligns sequences with local alignment algorithm • Highest identity regions returned as hits • Bayes aligner (http://www.bioinfo.rpi.edu/applications/bayesian/bayes/bayes.align12.pl) • Aligns two sequences with Bayesian algorithm • Even weakly conserved regions identified

Expression-profiling based method Microarray analyses allows identification of co-regulated genes Assume that promoters contain similar regulatory sites Find such sites by EM and Gibbs sampling using iteration of PSSM Co-expressed genes may be regulated at higher levels MEME (http://meme.sdsc.edu/meme/website/meme-intro.html) AlignACE (http://atlas.med.harvard.edu/cgi-bin/alignace.pl) Gibbs sampling algorithm

Web humour…

Chapter 8 Gene Prediction

Chapter 8 Gene Prediction

Presentation Transcript

Gene Prediction: Statistical Approaches

Gene Prediction: Statistical Approaches

Chapter 8: Prediction

Gene Prediction

Gene finding and gene structure prediction

Gene Prediction: Statistical Approaches

Gene Structure Prediction (Gene Finding)

Gene Prediction

Gene Prediction approaches

gene prediction

Gene Prediction: Statistical Methods

Gene Prediction: Statistical Approaches

Gene prediction

Gene Prediction: Computational Challenge

Gene Prediction

Chapter 8 Gene Expression

Gene Prediction in silico

Gene Prediction

Eukaryotic Gene Prediction

Gene Prediction

Gene Prediction

Chapter 8 Gene Expression