Gene Prediction approaches

Gene Prediction approaches Talk By Joy Scaria

The value of genome sequences lies in their annotation • Annotation – Characterizing genomic features using computational and experimental methods • Genes: Four levels of annotation • Gene Prediction – Where are genes? • What do they look like? • Domains – What do the proteins do? • Role – What pathway(s) involved in?

How many genes? • Consortium: 35,000 genes? • Celera: 30,000 genes? • Affymetrix: 60,000 human genes on GeneChips? • Incyte and HGS: over 120,000 genes? • GenBank: 49,000 unique gene coding sequences? • UniGene: > 89,000 clusters of unique ESTs?

Current consensus (in flux …) • 15,000 known genes (similarity to previously isolated genes and expressed sequences from a large variety of different organisms) • 17,000 predicted (GenScan, GeneFinder, GRAIL) • Based on and limited to previous knowledge

How to we get from here …

to here,

What are genes? - 1 • Complete DNA segments responsible to make functional products • Products • Proteins • Functional RNA molecules • RNAi (interfering RNA) • rRNA (ribosomal RNA) • snRNA (small nuclear) • snoRNA (small nucleolar) • tRNA (transfer RNA)

What are genes? - 2 • Definition vs. dynamic concept • Consider • Prokaryotic vs. eukaryotic gene models • Introns/exons • Posttranscriptional modifications • Alternative splicing • Differential expression • Multi-subunit proteins

Prokaryotic gene model: ORF-genes • “Small” genomes, high gene density • Haemophilus influenza genome 85% genic • Operons • One transcript, many genes • No introns. • One gene, one protein • Open reading frames • One ORF per gene • ORFs begin with start, end with stop codon (def.) TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html

Promoter…

Eukaryotic gene model: spliced genes • Posttranscriptional modification • 5’-CAP, polyA tail, splicing • Open reading frames • Mature mRNA contains ORF • All internal exons contain open “read-through” • Pre-start and post-stop sequences are UTRs • Multiple translates • One gene – many proteins via alternative splicing

Expansions and Clarifications • ORFs • Start – triplets – stop • Prokaryotes: gene = ORF • Eukaryotes: spliced genes or ORF genes Exons • Remain after introns have been removed • Flanking parts contain non-coding sequence (5’- and 3’-UTRs)

Where do genes live? • In genomes • Example: human genome • 3,200,000,000 base pairs • chromosomes : 1-22, X, Y, mt • 28,000-45,000 genes (current estimate) • 25% of genome are genes (introns, exons) • 1% of genome codes for amino acids (CDS) • 30 kb gene length (average) • 1.4 kb ORF length (average) • 3 transcripts per gene (average)

So much DNA – so “few” genes …

Genomic sequence features • Repeats (“Junk DNA”) • Transposable elements, simple repeats • RepeatMasker • Genes • Vary in density, length, structure • Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research • Pseudo genes • Look-a-likes of genes, obstruct gene finding efforts.

Gene identification • Homology-based gene prediction • Similarity Searches (e.g. BLAST, BLAT) • Genome Browsers • RNA evidence (ESTs) • Ab initio gene prediction • Gene prediction programs • Prokaryotes • ORF identification • Eukaryotes • Promoter prediction • PolyA-signal prediction • Splice site, start/stop-codon predictions

Gene prediction through comparative genomics • Highly similar (Conserved) regions between two genomes are useful or else they would have diverged • If genomes are too closely related all regions are similar, not just genes • If genomes are too far apart, analogous regions may be too dissimilar to be found

Genome Browsers NCBI Map Viewer www.ncbi.nlm.nih.gov/mapview/ Generic Genome Browser (CSHL) www.wormbase.org/db/seq/gbrowse Ensembl Genome Browser www.ensembl.org/ UCSC Genome Browser genome.ucsc.edu/cgi-bin/hgGateway?org=human Apollo Genome Browser www.bdgp.org/annot/apollo/

Gene discovery using ESTs • Expressed Sequence Tags (ESTs) represent sequences from expressed genes. • If region matches EST with high stringency then region is probably a gene or pseudo gene. • EST overlapping exon boundary gives an accurate prediction of exon boundary.

Ab initio gene prediction • Prokaryotes • ORF-Detectors • Eukaryotes • Position, extent & direction: through promoter and polyA-signal predictors • Structure: through splice site predictors • Exact location of coding sequences: through determination of relationships between potential start codons, splice sites, ORFs, and stop codons

Tools • ORF detectors • NCBI: http://www.ncbi.nih.gov/gorf/gorf.html • Promoter predictors • CSHL: http://rulai.cshl.org/software/index1.htm • BDGP: fruitfly.org/seq_tools/promoter.html • ICG: TATA-Box predictor • PolyA signal predictors • CSHL: argon.cshl.org/tabaska/polyadq_form.html • Splice site predictors • BDGP: http://www.fruitfly.org/seq_tools/splice.html • Start-/stop-codon identifiers • DNALC: Translator/ORF-Finder • BCM: Searchlauncher

Gene prediction programs • Rule-based programs • Use explicit set of rules to make decisions. • Example: GeneFinder • Neural Network-based programs • Use data set to build rules. • Examples: Grail, GrailEXP • Hidden Markov Model-based programs • Use probabilities of states and transitions between these states to predict features. • Examples: Genscan, GenomeScan

Common difficulties • First and last exons difficult to annotate because they contain UTRs. • Smaller genes are not statistically significant so they are thrown out. • Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known. • Masking repeats frequently removes potentially indicative chunks from the untranslated regions of genes that contain repetitive elements.

The annotation pipeline • Mask repeats using RepeatMasker. • Run sequence through several programs. • Take predicted genes and do similarity search against ESTs and genes from other organisms. • Do similarity search for non-coding sequences to find ncRNA.

Annotation nomenclature • Known Gene – Predicted gene matches the entire length of a known gene. • Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”. • Unknown Gene – Predicted gene matches a gene or EST of which the function is not known. • Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.

Gene Prediction approaches

Gene Prediction approaches

Presentation Transcript

Gene Prediction: Similarity-Based Approaches

Gene Prediction: Statistical Approaches

Gene Prediction: Statistical Approaches

Gene Prediction: Statistical Approaches Lecture 22

Gene Prediction: Statistical Approaches

Gene Prediction

Gene finding and gene structure prediction

Gene Prediction: Similarity-Based Approaches

Gene Prediction: Statistical Approaches

Gene Structure Prediction (Gene Finding)

Gene Prediction

gene prediction

Gene Prediction: Statistical Approaches

Gene prediction

Gene Prediction

Gene Prediction: Similarity-Based Approaches Lecture 23

Gene Prediction

Eukaryotic Gene Prediction

Gene Prediction

Gene Prediction

Gene Prediction: Statistical Approaches