300 likes | 404 Views
Experiences and suggestions for the annotation of tomato BAC clones. 2005-09-28. Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea. Contents. Phase-I Annotation Define gene structures Sample Annotations Future Works Acknowledgements. Phase-I Annotation.
E N D
Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea
Contents • Phase-I Annotation • Define gene structures • Sample Annotations • Future Works • Acknowledgements
Phase-I Annotation *1. Version 0.9 March 31, 2005 *2. KRIBB
Functional Annotation *1. Automated annotation should be converted to GO code for easier comparisons. *2. Classify into 5 class of gene annotations based on seqeunce similarity and availablity of expression data. (known / putative / similar to/ expressed / no evidence) *3. Use Arabidopsis Full Protein set to maximize the number of GO assigned genes.
Data Set for gene structure and annotation (Aug. 2005) • BACs • Sequenced: 29 (4 BACs overlapping in 2 pairs) • Annotated: 22 • ESTs : 200 015 (cf: Potato 193 233, Pepper 115 598) • Full-length mRNAs (GenBank): 596 • Full-length Proteins (UniProt 5.1): 1 044 • Protein DB (UniProt Release 5.1) • Swiss-PROT/TREMBL: 181 821 / 1 748 002 • Arabidopsis Proteins • GO associated (TAIR): 26 196 • Pathway/EC associated (KEGG): 1 520
Defining the Gene Structure • New Genomes, New Challenges: lack of data • To get best performance with given data, well-combined method is needed • Combine experimental data-based gene models • Extend the gene boundary and make up for the missing parts with predicted gene models • Final manual curation • Ex) EuGene for Medicago Genome Annotation
TSS Intron Splicing Signal Poly-A Site TAA TAG TGA ATG (Met) CpG GT---AG AATAAA P TIS CDS Stop Transcripts (Alternative SplicedForms (ESTs) Structure of Protein Coding Genes
Predict Predict mRNA EST Protein 1. Define gene structure by various data evidences • Full-length evidenced genes (mRNAs / Proteins) • Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library) • Partially evidenced genes (Other partial ESTs) • No-evidenced genes (Prediction only)
Predict mRNA Protein 1) Full-length Evidenced Genes • Gene locus with full-length mRNA / Protein (GMAP, GeneWise) • Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A, protein:CDS), Exon/Intron, (some alternative splicing structure) • Requirement: more than 1 mRNA or Proteins • Processing: • Merge the same AS forms • mRNA evidence: Predict CDS (ESTscan etc.) • Protein evidence: Mend gene boundary(TSS, poly-A)
Predict EST 2) Full-length Clue Evidenced Genes • Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP) • Gene boundary(TSS, poly-A), some Exon/Intron • Requirement: more than 1 full-length clue ESTs • Processing: • Merge the same AS forms • Link the same-cloned ESTs • Mend uncomplete portion with predicted model • CDS to be predicted (ESTscan / orfPredictor etc.)
3) Partially Evidenced Genes • Gene locus with general ESTs (GMAP) • Some Exon/Intron, poly-A • More ESTs, more information expected • Requirement: more than 2 ESTs with more than 2 couples of overlapped hard-edges • Processing: • Merge the same AS forms • Link the same-cloned ESTs • Mend incomplete portion with predicted model • CDS to be predicted (ESTscan/orfPredictor etc.) Predict EST1 EST2
Predict 4) No-evidenced Genes • Predicted model only (hypothetical gene) • Predicted CDS
2. Transcript-Genome mappers • Test BLAT/SIM4/GMAP/GeneSeqer • BLAT – Fast/Unaccurate • SIM4/GMAP/GeneSeqer – Approx. the Same results • KRIBB: Prefiltering ESTs by BLAT + GMAPing • Cutoff: Coverage > 80%, Identity > 92%
3. Protein-based Gene Models • GeneWise / FGENESH+ • KRIBB: GeneWise after prefiltering Proteins by BLASTx • BLASTx Cutoff: Coverage>80%, Identity>80%
1) Full-length Evidenced Gene: C02HBa0025N15.220 • mRNA/Protein evidence • Annotation • Product: SNF1 [Lycopersicon esculentum] • IPR000719 Prot_kinase • GO:0006468(P) protein amino acid phosphorylation • GO:0004672(F) protein kinase activity • EC:2.7.1.-: Snf1-related protein kinase (KIN10) (SKIN10) • TMHMM: outside
2) Full-length Evidenced Gene: C02HBa0066C13.60 • Protein evidence • Annotation • Product: phytochrome E [Lycopersicon esculentum] • IPR001294 Phytochrome • GO:0006355(P) regulation of transcription, DNA-dependent • GO:0008020(F) G-protein coupled photoreceptor activity • TargetP/TMHMM: C/outside • FunCat: 30.01 intracellular signalling 70.01 cell wall
3) Full-length Clue Evidenced Gene: C02HBa0060J03.170 • Kazusa full-length cDNA/EST evidence • Annotation • Product: putative protein [Arabidopsis thaliana] • IPR001251: CRAL_bd_TRIO_C • TMHMM: outside ~1Kb 3 Exon
4) Partially Evidenced Gene: C02HBa0060J03.90 • EST evidence • Annotation • Product: putative protein [Arabidopsis thaliana] • IPR000719 Prot_kinase • GO:0006468(P) protein amino acid phosphorylation • GO:0004672(F) protein kinase activity • GO:0016020(C) membrane • TMHMM: outside
5) Gene with alternative splicing: C02HBa0060J03.40-4 • EST evidence • Annotation • Product: transformer-SR ribonucleoprotein [N.tabacum] • IPR000504 RNA-binding region RNP-1 • GO:0003676(F) nucleic acid binding • GO:0030529(C) ribonucleoprotein complex • TargetP/TMHMM: C/outside
Annotation Results *1. All values from annotated 22 BACs.
Future Works • Training data set for Tomato gene HMM models • Automation • Performance assessment • Manual curation (Apollo)
http://sol.kribb.re.kr Thanks you for your attention!