240 likes | 392 Views
Solanaceae 2006 BAC Annotation. 2006. 07. 26 Plant Genome Research Center KRIBB, KOREA. Developmental Environments. OS : SGI IRIX 6.5 CPU : MIPS 500MHz 12 CPUs MEM : 12288 MB OS : SUSE Linux 9.0 version 2.6.11.4-21.11-bigsmp CPU : Intel(R) Xeon(TM) CPU 2.80GHz
E N D
Solanaceae 2006 BAC Annotation 2006. 07. 26 Plant Genome Research Center KRIBB, KOREA
Developmental Environments • OS : SGI IRIX 6.5 • CPU : MIPS 500MHz 12 CPUs • MEM : 12288 MB • OS : SUSE Linux 9.0 version 2.6.11.4-21.11-bigsmp • CPU : Intel(R) Xeon(TM) CPU 2.80GHz • MEM : 6231 MB • DBMS : MySQL-4.0.25 • Language : PHP 5.0.4, Apache 2.0.54, Perl-5.8.7
Data Sets • BACs (SGN test BACs) • Annotated: 10 • ESTs : 200,015 (cf: 202,043 -current) • Full-length mRNAs (GenBank): 596 • Protein DB (UniProt Release 7.7) • Swiss-Prot/trEMBL: 228,917 / 2,914,826 • Swiss-Prot/trEMBL(plant) 15,203 / 219,361 • Arabidopsis Proteins • Proteins, Genomes (TAIR): 30,693 • GO associated (TAIR): 28,812 • Pathway/EC associated (KEGG): 1,521 • Tomato Chip DATA - tomato Expression Database (cornell)
Predict Predict mRNA EST Protein Define gene structure by various data evidences • Full-length evidenced genes (mRNAs / Proteins) • Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library) • Partially evidenced genes (Other partial ESTs) • No-evidenced genes (Prediction only)
Sample Predicted Genes ESTs mRNAs Predict mRNA TIGR TC Protein stackPACK 1) Full-length Evidenced Genes • Gene locus with full-length mRNA / Protein (GMAP, GeneWise) • Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A, protein:CDS), Exon/Intron, (some alternative splicing structure) • Requirement: more than 1 mRNA or Proteins • Processing: • Merge the same AS forms • mRNA evidence: Predict CDS (ESTscan etc.) • Protein evidence: Mend gene boundary(TSS, poly-A)
Sample Predicted Genes Full length Clue ESTs(kazusa) ESTs Predict EST 2) Full-length Clue Evidenced Genes • Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP) • Gene boundary(TSS, poly-A), some Exon/Intron • Requirement: more than 1 full-length clue ESTs • Processing: • Merge the same AS forms • Link the same-cloned ESTs • Mend uncomplete portion with predicted model • CDS to be predicted (ESTscan / orfPredictor etc.)
Sample Predicted Genes ESTs 3) Partially Evidenced Genes • Gene locus with general ESTs (GMAP) • Some Exon/Intron, poly-A • More ESTs, more information expected • Requirement: more than 2 ESTs with more than 2 couples of overlapped hard-edges • Processing: • Merge the same AS forms • Link the same-cloned ESTs • Mend incomplete portion with predicted model • CDS to be predicted (ESTscan/orfPredictor etc.) Predict EST1 EST2
Sample Predict No Evidence !! 4) No-evidenced Genes • Predicted model only (hypothetical gene) • Predicted CDS
Gene Structure Annotation - Problems False positive intergenic region: 2 annotated genes actually correspond to a single gene False negative intergenic region: One annotated gene structure actually contains 2 genes False negative gene prediction: Missing gene (no annotation) Other: partially incorrect gene annotation missing annotation of alternative transcripts -Alternative Splicing Pseudo-genes Promoter / Regulatory Elements
Estimated Gene Prediction 1) hexamer signal A(A/U)AAA - PASes (predict polyadenylation signals) hexamers
Gene Structure Browser FGENESH GENSCAN Protein Repeats / Domain mRNA dbESTs TIGR TC Kazusa Full ESTs Unigene • Test BLAT/SIM4/GMAP/GeneSeqer • BLAT – Fast/Unaccurate • SIM4/GMAP/GeneSeqer – Approx. the Same results • KRIBB: Prefiltering ESTs by BLAT + GMAP • Cutoff: Coverage > 80%, Identity > 90%
Functional Annotation Protein DB/ EC / GO
Functional Annotation Protein DB / GO TFBS / Promoter
TargetP/TMHMM Enzyme / Pathway Domain / Motif Functional Annotation
Expression Annotation(Digital Expression ) Principle of identifying differentially expressed genes by Hypergeometric Test N: ESTs for all genes in all tissues,n: ESTs for selected genes in all tissues,K: ESTs for all genes in selected tissue,k: ESTs for selected gene in selected tissue,P: Significance of over- or under-expression in selected tissue
Expression Annotation (Tissue Specific Genes) Principle of identifying differentially expressed genes by Audic's Test x: number of cognate ESTs of a given gene in a selected libraryN1: selected libraryy: number of cognate ESTs of a given gene in other libraryN2: other library
Pepper tissue-specific gene analysis Fruit * 25 cycles, annealing temp. 55℃ * (# of ESTs) Floral bud Breaker Flower stem Bark Leaf root M.G Xag M.R Buf IM CaActin CacnA (16) CacnB (18) Flower CacnC (13) CacnD (10) CacnE (25) CacnF (31) Pathogen CacnG (20)
Thanks !! Solanaceae 2006 BAC Annotation Test page http://crop.kribb.re.kr/SOL-Test/ http://sol.kribb.re.kr/