1 / 30

Experiences and suggestions for the annotation of tomato BAC clones

Experiences and suggestions for the annotation of tomato BAC clones. 2005-09-28. Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea. Contents. Phase-I Annotation Define gene structures Sample Annotations Future Works Acknowledgements. Phase-I Annotation.

Download Presentation

Experiences and suggestions for the annotation of tomato BAC clones

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea

  2. Contents • Phase-I Annotation • Define gene structures • Sample Annotations • Future Works • Acknowledgements

  3. Phase-I Annotation *1. Version 0.9 March 31, 2005 *2. KRIBB

  4. Functional Annotation *1. Automated annotation should be converted to GO code for easier comparisons. *2. Classify into 5 class of gene annotations based on seqeunce similarity and availablity of expression data. (known / putative / similar to/ expressed / no evidence) *3. Use Arabidopsis Full Protein set to maximize the number of GO assigned genes.

  5. Data Set for gene structure and annotation (Aug. 2005) • BACs • Sequenced: 29 (4 BACs overlapping in 2 pairs) • Annotated: 22 • ESTs : 200 015 (cf: Potato 193 233, Pepper 115 598) • Full-length mRNAs (GenBank): 596 • Full-length Proteins (UniProt 5.1): 1 044 • Protein DB (UniProt Release 5.1) • Swiss-PROT/TREMBL: 181 821 / 1 748 002 • Arabidopsis Proteins • GO associated (TAIR): 26 196 • Pathway/EC associated (KEGG): 1 520

  6. Defining the Gene Structure • New Genomes, New Challenges: lack of data • To get best performance with given data, well-combined method is needed • Combine experimental data-based gene models • Extend the gene boundary and make up for the missing parts with predicted gene models • Final manual curation • Ex) EuGene for Medicago Genome Annotation

  7. TSS Intron Splicing Signal Poly-A Site TAA TAG TGA ATG (Met) CpG GT---AG AATAAA P TIS CDS Stop Transcripts (Alternative SplicedForms (ESTs) Structure of Protein Coding Genes

  8. Predict Predict mRNA EST Protein 1. Define gene structure by various data evidences • Full-length evidenced genes (mRNAs / Proteins) • Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library) • Partially evidenced genes (Other partial ESTs) • No-evidenced genes (Prediction only)

  9. Predict mRNA Protein 1) Full-length Evidenced Genes • Gene locus with full-length mRNA / Protein (GMAP, GeneWise) • Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A, protein:CDS), Exon/Intron, (some alternative splicing structure) • Requirement: more than 1 mRNA or Proteins • Processing: • Merge the same AS forms • mRNA evidence: Predict CDS (ESTscan etc.) • Protein evidence: Mend gene boundary(TSS, poly-A)

  10. Predict EST 2) Full-length Clue Evidenced Genes • Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP) • Gene boundary(TSS, poly-A), some Exon/Intron • Requirement: more than 1 full-length clue ESTs • Processing: • Merge the same AS forms • Link the same-cloned ESTs • Mend uncomplete portion with predicted model • CDS to be predicted (ESTscan / orfPredictor etc.)

  11. 3) Partially Evidenced Genes • Gene locus with general ESTs (GMAP) • Some Exon/Intron, poly-A • More ESTs, more information expected • Requirement: more than 2 ESTs with more than 2 couples of overlapped hard-edges • Processing: • Merge the same AS forms • Link the same-cloned ESTs • Mend incomplete portion with predicted model • CDS to be predicted (ESTscan/orfPredictor etc.) Predict EST1 EST2

  12. Predict 4) No-evidenced Genes • Predicted model only (hypothetical gene) • Predicted CDS

  13. 2. Transcript-Genome mappers • Test BLAT/SIM4/GMAP/GeneSeqer • BLAT – Fast/Unaccurate • SIM4/GMAP/GeneSeqer – Approx. the Same results • KRIBB: Prefiltering ESTs by BLAT + GMAPing • Cutoff: Coverage > 80%, Identity > 92%

  14. Problem of Repeat and Similarity?Or miss assembly?

  15. Similarity cutoff needed

  16. 3. Protein-based Gene Models • GeneWise / FGENESH+ • KRIBB: GeneWise after prefiltering Proteins by BLASTx • BLASTx Cutoff: Coverage>80%, Identity>80%

  17. Sample Annotations: define gene structure and annotation

  18. 1) Full-length Evidenced Gene: C02HBa0025N15.220 • mRNA/Protein evidence • Annotation • Product: SNF1 [Lycopersicon esculentum] • IPR000719 Prot_kinase • GO:0006468(P) protein amino acid phosphorylation • GO:0004672(F) protein kinase activity • EC:2.7.1.-: Snf1-related protein kinase (KIN10) (SKIN10) • TMHMM: outside

  19. 2) Full-length Evidenced Gene: C02HBa0066C13.60 • Protein evidence • Annotation • Product: phytochrome E [Lycopersicon esculentum] • IPR001294 Phytochrome • GO:0006355(P) regulation of transcription, DNA-dependent • GO:0008020(F) G-protein coupled photoreceptor activity • TargetP/TMHMM: C/outside • FunCat: 30.01 intracellular signalling 70.01 cell wall

  20. 3) Full-length Clue Evidenced Gene: C02HBa0060J03.170 • Kazusa full-length cDNA/EST evidence • Annotation • Product: putative protein [Arabidopsis thaliana] • IPR001251: CRAL_bd_TRIO_C • TMHMM: outside ~1Kb 3 Exon

  21. 4) Partially Evidenced Gene: C02HBa0060J03.90 • EST evidence • Annotation • Product: putative protein [Arabidopsis thaliana] • IPR000719 Prot_kinase • GO:0006468(P) protein amino acid phosphorylation • GO:0004672(F) protein kinase activity • GO:0016020(C) membrane • TMHMM: outside

  22. 5) Gene with alternative splicing: C02HBa0060J03.40-4 • EST evidence • Annotation • Product: transformer-SR ribonucleoprotein [N.tabacum] • IPR000504 RNA-binding region RNP-1 • GO:0003676(F) nucleic acid binding • GO:0030529(C) ribonucleoprotein complex • TargetP/TMHMM: C/outside

  23. Annotation Results *1. All values from annotated 22 BACs.

  24. Future Works • Training data set for Tomato gene HMM models • Automation • Performance assessment • Manual curation (Apollo)

  25. Acknowledgement

  26. http://sol.kribb.re.kr Thanks you for your attention!

More Related