510 likes | 538 Views
Gene Prediction and Annotation techniques Basics. Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 huynh@ncbi.nlm.nih.gov. Acknowledgement: Daniel Lawson, Neil Hall. GATCGGTCGAGCGTAAGCTAGCTAG ATCGATGATCGATCGGCCATATATC ACTAGAGCTAGAATCGATAATCGAT CGATATAGCTATAGCTATAGCCTAT. What is gene prediction?.
E N D
Gene Prediction and Annotation techniques Basics Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 huynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson, Neil Hall
GATCGGTCGAGCGTAAGCTAGCTAG ATCGATGATCGATCGGCCATATATC ACTAGAGCTAGAATCGATAATCGAT CGATATAGCTATAGCTATAGCCTAT What is gene prediction? Detecting meaningful signals in uncharacterised DNA sequences. Knowledge of the interesting information in DNA. Sorting the ‘chaff from the wheat’ • Gene prediction is ‘recognising protein-coding regions in genomic sequence’
Basic Gene Prediction Flow Chart Obtain new genomic DNA sequence 1. Translate in all six reading frames and compare to protein sequence databases 2. Perform database similarity search of expressed sequence tag Sites (EST) database of same organism, or cDNA sequences if available Use gene prediction program to locate genes Analyze regulatory sequences in the gene
Why is gene prediction important? • Increased volume of genome data generated • Paradigm shift from gene by gene sequencing (small scale) to large-scale genome sequencing. • No more one gene at a time. A lot of data. • Foundation for all further investigation. Knowledge of the protein-coding regions underpins functional genomics. Note: this presentation is for the prediction of genes that encode protein only; Not promoter prediction, sequences regulate activity of protein encoding genes
Map Viewer Genome Scan Models Contig GenBank Genes Mouse EST hits Human EST hits
N Start Middle End Knowing what to look for What is a gene? Not a full transcript with control regions The coding sequence (ATG -> STOP)
ORF Finding in Prokaryotes • Simplest method of finding DNA sequences that encode proteins by searching for open reading frames • An ORF is a DNA sequence that contains a contiguous set of codons that species an amino acid • Six possible reading frames • Good for prokaryotic system (no/little post translation modification) • Runs from Met (AUG) on mRNA stop codon TER (UAA, UAG, UGA) • http://www.ncbi.nlm.nih.gov/gorf/ NCBI ORF Finder
Annotation of eukaryotic genomes Genomic DNA ab initio gene prediction(w/o prior knowledge) transcription Unprocessed RNA RNA processing Comparative gene prediction (use other biological data) AAAAAAA Gm3 Mature mRNA translation Nascent polypeptide folding Active enzyme Functional identification Function Reactant A Product B
Two Classes of Sequence Information • Signal Terms – short sequence motifs (such as splice sites, branch points,Polypyrimidine tracts, start codons, and stop codons) • Content Terms – pattern of codon usage that are unique to a species and allow coding sequences to be distinguished from surrounding noncoding sequences by a statistical detection algorithm
Problem Using Codon Usage • Program must be taught what the codon usage patterns look like by presenting the program with a TRAINING SET of known coding sequences. • Different programs search for different patterns. • A NEW training set is needed for each species • Untranslated regions (UTR) at the ends of the genes cannot be detected, but most programs can identify polyadenylation sites • Non-protein coding RNA genes cannot be detected (attempt detection in a few specialized programs) • Non of these program can detect alternatively spliced transcripts
Explanation of False Positive/Negative in Gene Prediction Programs
Gene finding: Issues • Issues regarding gene finding in general • Genome size (larger genome ~ more genes, but …) • Genome composition • Genome complexity (more complexity -> less coding density; fewer genes per kb) • cis-splicing (processing mRNA in Eukaryotics) • trans-splicing (in kinetisplastid) • alternate splicing (e.g. in different tissues; higher organism) • Variation of genetic code from the universal code
Gene finding: genome • Genome composition • Long ORFs tend to be coding • Presence of more putative ORFs in GC rich genomes (Stop codons = UAA, UAG & UGA) • Genome complexity • Simple repetitive sequences (e.g. dinucleotide) and dispersed repeats tend to be anti-coding • May need to mask sequence prior to gene prediction
Gene finding: coding density As the coding/non-coding length ratio decreases, exon prediction becomes more complex Human Fugu worm E.coli
Gene finding: splicing • cis-splicing of genes • Finding multiple (short) exons is harder than finding a single (long) exon. • trans-splicing of genes • A trans-splice acceptor is no different to a normal splice acceptor worm E.coli
Gene finding: alternate splicing • Alternate splicing (isoforms) are very difficult to predict. Human A Human B Human C
GATCGGTCGAGCGTAAGCTAGCTAG ATCGATGATCGATCGGCCATATATC ACTAGAGCTAGAATCGATAATCGAT CGATATAGCTATAGCTATAGCCTAT ab initio prediction What is ab initio gene prediction? • Prediction from first principles using the raw DNA sequence only. • Requires ‘training sets’ of known gene structures to generate statistical tests for the likelihood of a prediction being real.
Gene finding: ab initio • What features of an ORF can we use? • Size - large open reading frames • DNA composition - codon usage / 3rd position codon bias • Kozak sequence CCGCCAUGG • Ribosome binding sites • Termination signal (stops) • Splice junction boundaries (acceptor/donor)
Gene finding: features Think of a CDS gene prediction as a linear series of sequence features: Initiation codon Coding sequence (exon) Splice donor (5’) N times Non-coding sequence (intron) Splice acceptor (3’) Coding sequence (exon) Termination codon
A model ab initio predictor • Locate and score all sequence features used in gene models • dynamic programming to make the high scoring model from available features. • e.g. Genefinder (Green) • Running a 5’-> 3’ pass the sequence through a Markov model based on a typical gene model • e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER (Salzberg) • Running a 5’->3’ pass the sequence through a neural net trained with confirmed gene models • e.g. GRAIL (Oak Ridge)
Ab initio Gene finding programs • Most gene finding software packages use a some variant of Hidden Markov Models (HMM). • Predict coding, intergenic, and intron sequences • Need to be trained on a specific organism. • Never perfect!
What is an HMM? • A statistical model that represents a gene. • Similar to a “weight matrix” that can recognise gaps and treat them in a systematic way. • Has different “states” that represent introns, exons, and intergenic regions.
Malaria Gene Prediction Tool • Hexamer – ftp://ftp.sanger.ac.uk/pub/pathogens/software/hexamer/ • Genefinder – email colin@u.washington.edu • GlimmerM – http://www.tigr.org/softlab/glimmerm • Phat – http://www.stat.berkeley.edu/users/scawley/Phat • Already Trained for Malaria!!!! The more experimental derived genes used for training the gene prediction tool the more reliable the gene predictor.
GlimmerMSalzberg et al. (1999) genomics 59 24-31 • Adaption of the prokaryotic genefinder Glimmer. Delcher et al. (1999) NAR 2 4363-4641 • Based on a interpolated HMM (IHMM). • Only used short chains of bases (markov chains) to generate probabilities. • Trained identically to Phat
An end to ab initio prediction • ab initio gene prediction is inaccurate • Have high false positive rates, but also low false negative rates for most predictors • Incorporating similarity info is meant to reduce false positive rate, but at the same also increase false negative rate. • Biggest determinant of false positive/negative is gene size. • Exon prediction sensitivity can be good • Rarely used as a final product • Human annotation runs multiple algorithms and scores exon predicted by multiple predictors. • Used as a starting point for refinement/verification • Prediction need correction and validation • -- Why not just build gene models by comparative means?
Annotation of eukaryotic genomes Genomic DNA ab initio gene prediction (w/o prior knowledge) transcription Unprocessed RNA RNA processing AAAAAAA Gm3 Mature mRNA Comparative gene prediction(use other biological data) translation Nascent polypeptide folding Active enzyme Functional identification Function Reactant A Product B
DNA RNA Protein If a cell was human? • The cell ‘knows’ how to splice a gene together. • We know some of these signals but not all and not all of the time • So compare with known examples from the species and others Central dogma for molecular biology Genome Transcriptome Proteome
Extract DNA and sequence genome DNA Extract RNA, reverse transcribe and sequence cDNA RNA Peptide sequence inferred from gene prediction Protein When a human looks at a cell • Compare with the rest of the genome/transcriptome/proteome data
comparative gene prediction • Use knowledge of known coding sequences to identify region of genomic DNA by similarity • transcriptome - transcribed DNA sequence • proteome - peptide sequence • genome - related genomic sequence
Transcript-based prediction: datasets • Generation of large numbers of Expressed Sequence Tags (ESTs) • Quick, cheap but random • Subtractive hybridisation to find rare transcripts • Use multiple libraries for different life-stages/conditions • Single-pass sequence prone to errors • Generation of small number of full length cDNA sequences • Slow and laborious but focused • Large-scale sequencing of (presumed) full length cDNAs • Systematic, multiplexed cloning/sequencing of CDS • Expensive and only viable if part of bigger project
Gene Prediction in Eukaryotes – Simplified • For highly conserved proteins: • Translate DNA sequence in all 6 reading frames • BLASTX or FASTAX to compare the sequence to a protein sequence database • Or • Protein compared against nucleic acid database including genomic sequence that is translated in all six possible reading frame sby TBLASTN, TFASTAX/TFASTY programs. • Note: Approximation of the gene structure only.
Transcript-based prediction: How it works • Align transcript data to genomic sequence using a pair-wise sequence comparison Gene Model: EST cDNA
Transcript-based gene prediction: algorithm • BLAST (Altshul) (36 hours) • Widely used and understood • HSPs often have ‘ragged’ ends so extends to the end of the introns • EST_GENOME (Mott) (3 days) • Dynamic programming post-process of BLAST • Slow and sometimes cryptic • BLAT (Kent) (1/2 hour) • Next generation of alignment algorithm • Design for looking at nearly identical sequences • Faster and more accurate than BLAST
Peptide-based gene prediction: algorithm • BLAST (Altshul) • Widely used and understood • Smith-Waterman • Preliminary to further processing • Used in preference to DNA-based similarities for evolutionary diverged species as peptide conservation is significantly higher than nucleotide
Genomic-based gene prediction: algorithm • BLAST (Altshul) • Can be used in TBLASTX mode • BLAT (Kent) • Can be used in a translated DNA vs translated DNA mode • Significantly faster than BLAST • WABA (Kent) • Designed to allow for 3rd position codon wobble • Slow with some outstanding problems • Only really used in C.elegans v C.briggsae analysis
Comparative gene predictors • This can be viewed as an extension of the ab initio prediction tools – where coding exons are defined by similarities and not codon bias • GAZE (Howe) is an extension of Phil Green’s Genefinder in which transcript data is used to define coding exons. Other features are scored as in the original Genefinder implementation. This is being evaluated and used in the C.elegans project. • GENEWISE (Birney) is a HMM based gene predictor which attempts to predict the closest CDS to a supplied peptide sequence. This is the workhorse predictor for the ENSEMBL project.
Comparative gene predictors • A new generation of comparative gene prediction tools is being developed to utilise the large amount of genomic sequence available. • Twinscan (WashU) attempts to predict genes using related genomic sequences. • Doublescan (Sanger) is a HMM based gene predictor which attempts to predict 2 orthologous CDS’s from genomic regions pre-defined as matching. • Both of these predictors are in development and will be used for the C.elegans v C.briggsae match and the Mouse v Human match later this year.
Summary • Genes are complex structure which are difficult to predict with the required level of accuracy/confidence • We can predict stops better than starts • We can only give gross confidence levels to predictions (i.e. confirmed, partially confirmed or predicted) • Gene prediction is only part of the annotation procedure • Movement from ab initio to comparative methodology as sequence data becomes available/affordable • Curation of gene models is an active process – the set of gene models for a genome is fluid and WILL change over time.
The Annotation Process DNA SEQUENCE Useful Information ANNALYSIS SOFTWARE Annotator
DNA sequence Gene finders Blastn Blastx Halfwise tRNA scan RepeatMasker Repeats Promoters rRNA Pseudo-Genes Genes tRNA Fasta BlastP Pfam Prosite Psort SignalP TMHMM Annotation Process
Artemis • Artemis is a free DNA sequence viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation. • http://www.sanger.ac.uk/Software/Artemis/
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgttatcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa cacataacaattataatgacatatcaaataataataataataataataatattaatggggtgaaagaccatataaataataacactctggaaaataatga tgaaccaatcttatctatatataatgaagatcttaatgttttatatatatgccaaaatatgtataacgtcctttttgttttgaatttaaataacctaagt
DNA in Artemis GC content Black bar = stop codon Forward translations Reverse Translations DNA and amino acids