1.41k likes | 1.61k Views
Finding Genes In a Genome. Cédric Notredame. Naked Genome. All Dressed Up!. Naked Genomes are Useless. -Experimental Methods. -ESTs, THS, DNA Chips…. -Computational Methods. -Homology, Ab-Initio. Useful Genome Accurate Annotation. ANNOTATION. -Where are the genes ?.
E N D
Finding Genes In a Genome Cédric Notredame
Naked Genomes are Useless -Experimental Methods -ESTs, THS, DNA Chips… -Computational Methods -Homology, Ab-Initio Useful Genome Accurate Annotation
ANNOTATION -Where are the genes ? -What do they do: Biochemistry ? -When do they do it: Regulation ? -Who do they do it for: Metabolic ?
Outline 1. Cleaning the genome Prokaryotes 2. Similarity methods 3. Experimental Methods Eukaryotes 4. Ab-initio Methods 5-How Good Are The Methods ?? Naked Genome => Fully Dressed Sequence
Outline Prokaryotes Eukaryotes
Gene Fishing in Prokaryotic Genomes
What is a Prokaryotic Gene ? Gene RBS Promoter Terminator ATG STOP mRNA ORF Protein
2-Homology Based Methods 1-Ab-initio: -ORFing -Codon Bias RBS Promoter Terminator mRNA STOP 3-Regulatory Sequence Detection -Non Coding -Short Genes
Prokaryotic Genomes In a prokaryotic Genome, any ORF longer than 300 nt Can SAFELY be considered to be a gene -High Gene Density: Haemophilus Influenza: 85% -No Introns -Operons
Prokaryotic Genomes Clean-up ORFing Gene Prediction Homology Search Promoter Detection
Cleaning Your DNA Sequence
Cleaning a DNA Sequence -Cloning may lead to the inclusion of Vector Sequences. -These sequences must be removed Is My Sequence Contaminated ?
Paste in your new sequence
Our sequence displays two vector contaminations Crop
Contamination Matters BUT Genuine Genome may Contain Similarity tothe Cloning vector (Antibiotics Resistance) Contaminations Look Like Horizontal Transfers -Wrong Phylogeny -Error Propagation in Secondary Databases -Eukaryote Genomes can also be cleaned this way
ORFing Prokaryotic Genomes
Prokaryotic Genomes: ORFing Where are the ORFs In my Sequence ?
Prokaryotic Genomes: ORFing STOP Codons ATG (Start) Codons
Prokaryotic Genomes: GORF www.ncbi.nih.gov/gorf/gorf.html
Prokaryotic Genomes: GORF TO COG TO BLAST
GORF: Can You Trust it ??? Random ORF Random 3rd Position Real ORF Biased 3rd Position
Prokaryotic Genomes: GORFing cDNAs Works with Bacterial Genomes Good enough for ~85% proteome Works with Eukaryotic cDNA BUT… -Will NOT detect SHORT genes -Will NOT detect Non Coding Genes
Ab-Initio Gene PredictionsIn Prokaryotic Genomes
Predicting Genes What are the sequences in my genome that LOOK LIKE Genes
Using The Codon Biases Coding RegionsDo NOT look LikeRandom DNA: -Codon Bias
Predicting Genes ALL the characteristics of a Gene can be Built into a model Hidden Markov Model
Hidden Markov Model -Each Nucleotide has a STATE: Coding/Non Coding … -This STATE is HIDDEN -The HMM tries to UNCOVER the STATE of each Nucleotide.
Hidden Markov Model Occasionally Dishonest CAsino … Observation: 122234455666125654151661661515566616166661 State : FFFFFFFFLLLLFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLL -This STATE is HIDDEN in the data
G GGG 0.02 G GGG 0.02 G GGA 0.00 G GGA 0.00 G GGT 0.6 G GGT 0.6 G GGC 0.38 G GGC 0.38 E 64 Codons 64 Codons W TGG 1.00 W TGG 1.00 Simplified HMM for Coding Regions S
Emission Proba Transition Proba Simplified HMM for Coding Regions
HMM order 5: 6th Nucleotide depends on the 5 previous Takes into account Codon Bias AND dipeptide Comp Simplified HMM for Coding Regions Proba of seq (GGG-TGG Given Model) = Proba(GGG)*Proba(GGG->TGG)*Proba(TGG)
Translate Predicted Genes into Proteins http://opal.biology.gatech.edu/GeneMark/ Text Output
GeneMark and HMM predictions Works Very Well Good enough for ~99% proteome BUT… -Will NOT detect Some SHORT genes -Will NOT detect Non Coding Genes