1 / 140

Finding Genes In a Genome

Finding Genes In a Genome. Cédric Notredame. Naked Genome. All Dressed Up!. Naked Genomes are Useless. -Experimental Methods. -ESTs, THS, DNA Chips…. -Computational Methods. -Homology, Ab-Initio. Useful Genome  Accurate Annotation. ANNOTATION. -Where are the genes ?.

faolan
Download Presentation

Finding Genes In a Genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Genes In a Genome Cédric Notredame

  2. Naked Genome

  3. All Dressed Up!

  4. Naked Genomes are Useless -Experimental Methods -ESTs, THS, DNA Chips… -Computational Methods -Homology, Ab-Initio Useful Genome  Accurate Annotation

  5. ANNOTATION -Where are the genes ? -What do they do: Biochemistry ? -When do they do it: Regulation ? -Who do they do it for: Metabolic ?

  6. Outline 1. Cleaning the genome Prokaryotes 2. Similarity methods 3. Experimental Methods Eukaryotes 4. Ab-initio Methods 5-How Good Are The Methods ?? Naked Genome => Fully Dressed Sequence

  7. Outline Prokaryotes Eukaryotes

  8. Gene Fishing in Prokaryotic Genomes

  9. What is a Prokaryotic Gene ? Gene RBS Promoter Terminator ATG STOP mRNA ORF Protein

  10. What is a Prokaryotic Gene:Operon

  11. 2-Homology Based Methods 1-Ab-initio: -ORFing -Codon Bias RBS Promoter Terminator mRNA STOP 3-Regulatory Sequence Detection -Non Coding -Short Genes

  12. Prokaryotic Genomes In a prokaryotic Genome, any ORF longer than 300 nt Can SAFELY be considered to be a gene -High Gene Density: Haemophilus Influenza: 85% -No Introns -Operons

  13. Prokaryotic Genomes Clean-up ORFing Gene Prediction Homology Search Promoter Detection

  14. Cleaning Your DNA Sequence

  15. Cleaning a DNA Sequence -Cloning may lead to the inclusion of Vector Sequences. -These sequences must be removed Is My Sequence Contaminated ?

  16. Paste in your new sequence

  17. Our sequence displays two vector contaminations Crop

  18. Contamination Matters BUT Genuine Genome may Contain Similarity tothe Cloning vector (Antibiotics Resistance) Contaminations Look Like Horizontal Transfers -Wrong Phylogeny -Error Propagation in Secondary Databases -Eukaryote Genomes can also be cleaned this way

  19. ORFing Prokaryotic Genomes

  20. Prokaryotic Genomes: ORFing Where are the ORFs In my Sequence ?

  21. Prokaryotic Genomes: ORFing STOP Codons ATG (Start) Codons

  22. Prokaryotic Genomes: ORFing

  23. Prokaryotic Genomes: GORF www.ncbi.nih.gov/gorf/gorf.html

  24. Prokaryotic Genomes: GORF

  25. Prokaryotic Genomes: GORF TO COG TO BLAST

  26. Prokaryotic Genomes: GORF

  27. GORF: Can You Trust it ??? Random ORF Random 3rd Position Real ORF  Biased 3rd Position

  28. GORF: Can You Trust it ???

  29. Prokaryotic Genomes: GORFing cDNAs Works with Bacterial Genomes Good enough for ~85% proteome Works with Eukaryotic cDNA BUT… -Will NOT detect SHORT genes -Will NOT detect Non Coding Genes

  30. Ab-Initio Gene PredictionsIn Prokaryotic Genomes

  31. Predicting Genes What are the sequences in my genome that LOOK LIKE Genes

  32. Using The Codon Biases

  33. Using The Codon Biases Coding RegionsDo NOT look LikeRandom DNA: -Codon Bias

  34. Real Genes Use Mostly the Optimal Codons

  35. Predicting Genes ALL the characteristics of a Gene can be Built into a model Hidden Markov Model

  36. Hidden Markov Model -Each Nucleotide has a STATE: Coding/Non Coding … -This STATE is HIDDEN -The HMM tries to UNCOVER the STATE of each Nucleotide.

  37. Hidden Markov Model Occasionally Dishonest CAsino … Observation: 122234455666125654151661661515566616166661 State : FFFFFFFFLLLLFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLL -This STATE is HIDDEN in the data

  38. GeneMark

  39. G GGG 0.02 G GGG 0.02 G GGA 0.00 G GGA 0.00 G GGT 0.6 G GGT 0.6 G GGC 0.38 G GGC 0.38 E 64 Codons 64 Codons W TGG 1.00 W TGG 1.00 Simplified HMM for Coding Regions S

  40. Emission Proba Transition Proba Simplified HMM for Coding Regions

  41. HMM order 5: 6th Nucleotide depends on the 5 previous Takes into account Codon Bias AND dipeptide Comp Simplified HMM for Coding Regions Proba of seq (GGG-TGG Given Model) = Proba(GGG)*Proba(GGG->TGG)*Proba(TGG)

  42. Translate Predicted Genes into Proteins http://opal.biology.gatech.edu/GeneMark/ Text Output

  43. Non Standard FASTA

  44. GLIMMER: An alternative to GeneMark

  45. Main Problems

  46. GeneMark and HMM predictions Works Very Well Good enough for ~99% proteome BUT… -Will NOT detect Some SHORT genes -Will NOT detect Non Coding Genes

More Related