190 likes | 200 Views
Allgenes.org is a web interface providing access to EST and mRNA sequences within the GUS database, with both automated and manual annotation efforts to associate them with their corresponding genes. This manual annotation tool assists in validating automated assignments and adding additional gene information.
E N D
Manual Annotation of the human and mouse gene index: www.allgenes.org
Brian Brunk, Jonathan Crabtree, Sharon Diskin, Joan Mazzarelli, Chris Stoeckert and Nico Zigouras Computational Biology and Informatics Laboratory, Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104 Vera Bogdanova, Alexey Katohkin, Nikolay Kolchanov, Vorbjeva Nadezhda, Elena Semjonova and Vladimir Trifonoff Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia Abstract Allgenes.org is a web interface providing access to the assembled EST and mRNA sequences, or DoTS RNA transcripts, contained within GUS (Genomics Unified Schema), a relational database. The DoTS transcripts integrate annotation from cDNA libraries (tissue source) and RH mapping data also stored in GUS. Automated annotation has been applied to the DoTS transcripts to determine their predicted gene ownership, protein sequences and GO Functions. Manual annotation efforts have focused on validating the automated annotation and adding additional gene information. Manual annotation of the gene index utilizes an annotation tool, the GUS annotator interface, which directly updates the GUS database. Functional features of the interface which allow defined annotation tasks to be performed by the annotator include: determination of transcript gene membership using BLAST similarities and transcript alignments to genomic sequence, assignment of approved (HUGO or MGI) gene symbol, gene synonyms and confirmation/addition of protein GO Function assignments. Evidence for the automated annotation is stored in GUS and provided to the annotator to assist in the validation of the assignments. Evidence is also manually added by the annotator for each assignment and is stored in GUS. The human DoTS transcripts have been aligned on the UCSC Golden path contigs allowing for the identification of new genes, alternative transcript forms and annotation of the genome. Manual annotation efforts have focused initially on the genes contained within the region deleted on chromosome 22, causing DiGeorge syndrome, a developmental disorder.
See also poster #114 Allgenes.org A web interface providing access to the assembled EST and mRNA sequences, or DoTS RNA transcripts, contained within GUS (Genomics Unified Schema), a relational database. Computed & manual annotation has been applied to the human and mouse DoTS RNA transcripts to associate them with their corresponding genes, creating a human and mouse gene index, Allgenes.
DoTS RNA transcripts Incoming Sequences (EST/mRNA) • GenBank, dbEST sequences • Make Quality (remove vector, polyA, NNNs) The assembly of sequences generates a consensus sequence or DoTS transcript “Quality” sequences • Block with RepeatMasker Blocked sequences • Blastn to cluster sequences “Unassembled” clusters • Assemble sequences with CAP4 CAP4 assemblies (generate consensus sequences) BLASTn DoTs consensus sequences (98% identity, 150bps) Gene Cluster (RNA s in the Gene) Dots Consensus sequences
Diagram of GUS computed & manual Annotation Genomic Sequence mRNA/EST Sequence Clustering and Assembly Gene predictions GRAIL/GenScan BLAST/SIM4 Predicted Genes DoTS consensus Sequences Merge Genes Gene/RNA cluster assignment Annotate DoTS Manual Annotation Tasks Gene Index framefinder RNAs Proteins translation BLASTX PFAM, Smart, ProDom BLASTP Other computed annotation (EPCR, AssemblyAnatomyPercent, Index Key Words, SNP analysis) BLAST Similarities Functional predictions Protein Motifs GO Functions
Manual Annotation of a human disease-related gene associated with Chr 22 deletions The example is a gene located on chromosome 22, implicated in DiGeorge syndrome (DGS). DGS, a developmental disorder, is marked by the absence or underdevelopment of the thymus and parathyroid glands. Most cases of DGS result from a deletion in chromosome 22. Several genes are lost, including HIRA, Homo sapiens HIR (histone cell cycle regulation defective, S. cerevisiae) homolog A. DoTS transcripts aligned on genomic sequence aids in accessing predicted gene ownership The HIRA DoTS transcripts have BLAST-Sim4 aligned to a region of the chromosome 22 contig, containing the HIRA gene. The HIRA gene contains 25 exons revealed by the sequence alignment of DT.452034. An alternative form of the gene transcript, TUPLE1 (DT.86855487), contains only 21 of the 25 exons, due to alternative splicing, specifically exon skipping (1). Other DTs transcripts, e.g., DT. 92425739 and DT.86855485, align to this portion of the contig, possibly representing additional alternative HIRA RNA transcripts.
Region deleted in DiGeorge Syndrome * The human region on chromosome 22 commonly deleted in DGS patients, and the orthologous region in mice on chromosome 16 are very similar in gene content, but show a different genomic organization (from Epstein & Buck, Pediatric Research 2000 vol.48 pg.722)
Annotation Tool implementation • Java servlet • Java Data Base Conductivity • XML updates using Perl Object layer Manual Annotation efforts utilize: A web-based tool to update GUS
Gene Annotation Tasks Annotation Tasks include: 1. Assigning reference RNA sequence. 2. Determining members of Gene cluster (RNA transcripts) – removing or adding members - validating RNA transcripts assigned to gene using genomic alignment, BLAST similarity and/or cDNA clone linkages 3. Adding approved abbreviated gene name or symbol (if known) and evidence Gene description field: 4. Adding approved full gene name, aliases and evidence for them 5. Adding gene synonyms and evidence for them
Editing Gene Page: Annotator Interface DoTS RNA transcripts
Editing RNA Page: Annotator Interface
RNA Annotation Tasks 6. Modifying TS (RNA) description of reference sequence to reflect HUGO or MGI approved full gene name, if assigned. 7. GO Function assignment/verification – GO Function manually assigned; predicted GO Functions are verified
Example of Evidence Retrieved from GUS for GO Function Assignment
(RNA description) (Confirmed or manually added GO Functions) RNA Annotation is displayed on RNA page of allgenes.org
Progress Number of RNAs (DoTs transcripts) annotated - 2818 Number of Genes annotated – 967 The annotation team has focused on annotating DoTs transcripts, corresponding to mouse cDNAs on a Pancreas-specific microarray chip (PancChip 2.0, http://www.cbil.upenn.edu/EPConDB/) Future Annotation Chromosome 22 with focus on DiGeorge Region Human & mouse genomic comparisons
Reference 1. Llevadot, R., Scambler, P., Estivill, X., and Pritchard, M. (1996) Genomic organization of TUPLE1/HIRA: a gene implicated in DiGeorge syndrome. Mammalian Genome 7:911-4. Future Features of the next version of the Annotator Interface • For gene annotation: • - gene exon/intron structure analysis to create gene models • links to other databases • For RNA annotation: • -represent RNA forms with category • (e.g., alt. polyadenylation site usage) • -tissue expression • For Protein Annotation: • - protein name, synonyms • - GO cellular component • - GO biological process • - protein-protein interactions, pathways