330 likes | 402 Views
SO meets RNAO. Karen Eilbeck University of Utah RNAO Consortium Meeting May 28-29 2007. What SO is. How SO is used How SO is managed Where do SO and RNAO meet How SO and RNAO can work together If we have time - a demo of OBO-Edit.
E N D
SO meets RNAO Karen Eilbeck University of Utah RNAO Consortium Meeting May 28-29 2007
What SO is. • How SO is used • How SO is managed • Where do SO and RNAO meet • How SO and RNAO can work together • If we have time - a demo of OBO-Edit
The Sequence Ontology describes the features of biological sequence • Genome sequence • Annotation of regions • Coordinates • Need to agree on meaning of terms. E.g. Does the CDS include the stop codon?
An annotation captures what we know about a gene 3 Alternate transcripts of Glut1 gene evidence Annotations Start codon 5’ UTR Coding exon Transposon within intron
d i i i i i i i i i i P P P P P P P P Structure of the ontology exon transcript • SO is structured into a directed acyclic graph. intron processed transcript polyA site primary transcript clip splice site protein coding primary transcript nc primary transcript CDS ncRNA mRNA UTR tRNA rRNA five_prime_UTR three_prime_UTR
GFF3 • SO is used to ‘type’ the features and relationships. Id type start end strand attributes ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3 ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003 ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mRNA00001,mRNA00002 ctg123 . exon 3000 3902 . + . ID=exon00003;Parent=mRNA00001,mRNA00003 ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003 ctg123 . exon 7000 9000 . + . ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003 terms relationships
Why we made SO • Standardize vocabulary used in genomics. • Clarify the relationships between the terms. • Make genomics data more computable by adding semantics to the sequence. Its not just about sequence similarity.
What is the scope of SO? • Features that can be located on a sequence with coordinates. exon, promoter, binding_site • Properties of these features: • Sequence attributes • Maternally_imprinted • Consequences of mutation • mutation_affecting_editing • Chromosome variation • aneuploid
Model Organism DB SGD (MGI) FlyBase WormBase DictyBase Pombe GMOD Comparative genomics MGED Ontology NLP The SO community
Genome annotation unification • The model organism databases use SO to type their features. • The GFF3 file format for annotation, the Chado db schema and DAS2 annotation protocol rely on SO to type features.
Genomic analysis • The Comparative Genomics Library written in Perl uses SO based annotations to perform complex analysis over multiple genomes. • Yandell M, Mungall CJ, Smith C, Prochnik S, Kaminker J, Hartzell G, Lewis S, Rubin GM. 2006. Large-Scale Trends in the Evolution of Gene Structures within 11 Animal Genomes. PLoS Comput Biol. 2:e15
Genome data integration • Multiple genomes are organized using SO: • Flymine, • Gramene, • the BRCs
NLP/text mining • Recently SO have been used for some new projects - • Semantic enrichment by the Royal Society of Chemistry. • Anaphora resolution by the NLIP group in Cambridge.
How SO is managed • SO uses CVS to manage and version the ontology. • There is a mailing list for developers to get things off their chest. • There is a tracker for term suggestions • There are workshops when we get a critical mass for a given problem. We want to do more workshops. • SO is expressed in OBO format.
Example of OBO format • http://www.geneontology.org/GO.format.obo-1_2.shtml [Term] id: SO:0000587 name: group_I_intron def: "Group I catalytic introns are large self-splicing ribozymes. They catalyse their own excision from mRNA, tRNA and rRNA precursors in a wide range of organisms. The core secondary structure consists of 9 paired regions (P1-P9). These fold to essentially two domains, the P4-P6 domain (formed from the stacking of P5, P4, P6 and P6a helices) and the P3-P9 domain (formed from the P8, P3, P7 and P9 helices). Group I catalytic introns often have long ORFs inserted in loop regions." [http://www.sanger.ac.uk/cgi-bin/Rfam/getacc?RF00028] subset: SOFA is_a: SO:0000188 ! intron
OBO and OWL • http://purl.org/obo/owl/SO • Mapping OBO and OWL http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page
Navigate SO using OBO-Edit Search the ontology Details for selected term Structure of the ontology All parents of the term
AGAGGGCGAATCCAGCTCTGGAGCAGAGGCTCTGGCAGCTTTTGCAGCGTTTATATAACATGAAATATATATACGCATTCCGATCAAAGCTGGGTTAACCAGATAGATAGATAGTAACGTTTAAATAGCGCCTGGCGCGTTCGATTTTAAAGAGATTTAGAGCGTTATCCCGTGCCTATAGATCTTATAGTATAGACAACGAACGATCACTCAAATCCAAGTCAATAATTCAAGAATTTATGTCTGTTTCTGTGAAAGGGAAACTAATTTTGTTAAAGAAGACTTACAATATCGTAATACTTGTTCAATCGTCGTGGCCGATAGAAATATCTTACAATCCGAAAGTTGATGAATGGAATTGGTCTGCAACTGGTCGCCTTCATTTCGTAAAATGTTCGCTTGCGGCCGAAAAATTTCGATATATCTACAATTGATCTACAATCTTTACTAAATTTTGAAAAAGGAACACTTTGAATTTCGAACTGTCAATCGTATCATTAGAATTTAATCTAAATTTAAATCTTGCTAAAGGAAATAGCAAGGAACACTTTCGTCGTCGGCTACGCATTCATTGTAAAATTTTAAATTTTGACATTCCGCACTTTTTGATAGATAAGCGAAGAGTATTTTTATTACATGTATCGCAAGTATTCATTTCAACACACATATCTATATATATATATATATATATATATATATATATATATATATATGTTATATATTTATTCAATTTTGTTTACCATTGATCAATTTTTCACACATGAAACAACCGCCAGCATTATATAATTTTTTTATTTTTTTAAAAAATGTGTACACATATTCTGAAAATGAAAAATTCAATGGCTCGAGTGCCAAATAAAGAAATGGTTACAATTTAAGGAGAGGGCGAATCCAGCTCTGGAGCAGAGGCTCTGGCAGCTTTTGCAGCGTTTATATAACATGAAATATATATACGCATTCCGATCAAAGCTGGGTTAACCAGATAGATAGATAGTAACGTTTAAATAGCGCCTGGCGCGTTCGATTTTAAAGAGATTTAGAGCGTTATCCCGTGCCTATAGATCTTATAGTATAGACAACGAACGATCACTCAAATCCAAGTCAATAATTCAAGAATTTATGTCTGTTTCTGTGAAAGGGAAACTAATTTTGTTAAAGAAGACTTACAATATCGTAATACTTGTTCAATCGTCGTGGCCGATAGAAATATCTTACAATCCGAAAGTTGATGAATGGAATTGGTCTGCAACTGGTCGCCTTCATTTCGTAAAATGTTCGCTTGCGGCCGAAAAATTTCGATATATCTACAATTGATCTACAATCTTTACTAAATTTTGAAAAAGGAACACTTTGAATTTCGAACTGTCAATCGTATCATTAGAATTTAATCTAAATTTAAATCTTGCTAAAGGAAATAGCAAGGAACACTTTCGTCGTCGGCTACGCATTCATTGTAAAATTTTAAATTTTGACATTCCGCACTTTTTGATAGATAAGCGAAGAGTATTTTTATTACATGTATCGCAAGTATTCATTTCAACACACATATCTATATATATATATATATATATATATATATATATATATATATATGTTATATATTTATTCAATTTTGTTTACCATTGATCAATTTTTCACACATGAAACAACCGCCAGCATTATATAATTTTTTTATTTTTTTAAAAAATGTGTACACATATTCTGAAAATGAAAAATTCAATGGCTCGAGTGCCAAATAAAGAAATGGTTACAATTTAAGG Annotating with SO and RNAO Translational control element The nanos translational control element represses translation in somatic cells by a Bearded box-like motif.・Duchow HK, Brechbiel JL, Chatterjee S, Gavis ER. Developmental Biology Volume 282, Issue 1, 1 June 2005, Pages 207-217
Overlap with RNAO • SO provides regions of sequence - start and stop coordinates with regards to the whole sequence - i.e. assembly / chromosome • Transcripts and parts of transcripts • Some secondary structure • Some motifs • Results of algorithms such as blast
Secondary structure • This part of SO needs work. • Any volunteers?
Divergent from RNAO • Where do SO and RNAO differ dramatically? • Multiple sequence alignments. SO does not provide a solution to this. It does however provide the terms to describe the results of sequence similarity searches. • Numerical results. SO has not needed to use values so far.
RNAO working groups • Motif identification/annotation • RNA interaction • Biochemical-structuremapping • Multiple sequence alignment • Backbone conformation • Base stacking
Working together • Remain 2 separate ontologies. • Give SO annotators option of ‘importing’ RNAO terms using the OBO programs • SO and RNAO work together to align key terms in their ontologies.
SO is still evolving • RNAO could use the SO features to describe regions of sequence • SO could reference RNAO for detailed annotation of structure and biochemical features.
Multiple ontologies in OBO • 2 options. • The ontologies reference each other: • Will always need to load both ontologies • There is a mapping file that you can load to import external terms. • Maintain separate ontologies and keep mapping up to date. http://obofoundry.org/wiki/index.php/Mappings
Example: Importing terms from SCOR. • 1. Made an OBO file from a subset of SCOR terms • 2. Work out where there is overlap • 3. Make OBO mapping file between the two ontologies • 4. Load all 3 files at once.
format-version: 1.2 date: 16:05:2007 15:26 saved-by: kareneilbeck auto-generated-by: OBO-Edit 1.100 [Term] id: SC:0000000 name: hairpin_loop [Term] id: SC:0000001 name: diloop is_a: SC:0000000 ! hairpin_loop [Term] id: SC:0000002 name: triloop is_a: SC:0000000 ! hairpin_loop … format-version: 1.2 date: 24:05:2007 10:37 saved-by: kareneilbeck import: so-xp.obo import: scor2.obo id: SC:0000015 hairpin loop is_a: SO:0000715 is_a RNA motif id: SC:0000016 internal loop is_a: SO:0000715 is_a RNA motif id: SC:0000035 tertiary interaction is_a: SO:0000122 is_a RNA sequence secondary structure scor.obo mapping file
OBO-Edit DEMO • Fingers crossed…
Possible action items • A SO-RNAO mailing list for discussion of collaboration • Phone/skype/webinars at intervals to keep track of progress.
Resources • GFF3 http://www.sequenceontology.org/gff3.shtml • Apollo http://www.fruitfly.org/annot/apollo/ • SO http://www.sequenceontology.org • OBO-Edit http://sourceforge.net/projects/geneontology • OBO foundry http://www.obofoundry.org • GO-perl http://www.godatabase.org/dev/go-perl/doc/go-perl-doc.html
Acknowledgements • SO is funded as part fo the Gene Ontology Consortium, via the NIH P41-HG002274 • People: • Suzi Lewis and Michael Ashburner - the vision • Chris Mungall - programming infrastructure • John Richter - made OBO-Edit