230 likes | 347 Views
Biological sequences and SO. Karen Eilbeck University of Utah Towards Interoperability of Biomedical Ontologies 27.03.2007 - 30.03.2007. SO categorizes the kinds of, parts of and properties of sequence. How SO is organized - from features and qualities to cross-products
E N D
Biological sequences and SO Karen Eilbeck University of Utah Towards Interoperability of Biomedical Ontologies 27.03.2007 - 30.03.2007
SO categorizes the kinds of, parts of and properties of sequence. • How SO is organized - from features and qualities to cross-products • Let’s interoperate - questions that need to be answered.
What does sequence look like? • >3R:21066761,21072884 • tcaacgaaaactcggaggccatttacaaggagacggccaaagcaatcgaccgatcctttggcaaactttacctgggcgtcgtcaaaggtgtgttctccaaactgccgtatgccaagttttttgcggatgaatcgtgagttagctcttcaaagtgggcagagtccacataaagatactagatcatgttgtttgcgtactgacagatctaagttttgaggctagcaatcatcattaggtttaatggagttcgtgtttcgcgtttgaaagtgagaacacaagtaactactattaagccatctcagctaaataatctgtaagtgttgtgtggcaataaaagttacatatatgtagttagcacattgtaaattatttataagtgatacaaagaattctgtaaaataccataaaaacatttaaaactatgacccattattaattaagttacagtgagtggaaccctatagatcaggttgatccaaaagatgaaggaccgcctgaaagtatgtgttattcgcgcgcggagattccgaaaggcagggaatatctgtaactggaaaaggcagttacattaaaaaaaagcttgataaacaatctttgttgacttagccattaattagacgttgaaacgggaattaatgtgcgttttggggaaggccgatccaatttgcatatatcgagcaaattgcacccaaaacgcgattaggagcgattgaatgggacggggtcgatgtctggcttgggagttgggaacttgggagttcattaaagggaatcgtaaaatgaattcgccggctataaccagccactttgccatacagccagcctgccggtttcggtttataatccatttaactgactcaactgccaaacggtctaaagtcaaattctgtgcggctgaaacgcaaaagcggtttacggcaacaaaaacatgatacatttcaattgacgaaagtgactatataagtgttaacgcccgcggctaatggatcagtactcgattacgttcgccgccagcaattatggagctaactctcgccctcgtcctgctgttcggatgtgcgtccacctacggacacgcctccgatttccgtgagttgcacacaccccctttccgtaatatataatgtttatgtatatttcacaatcgccacgccccaatttacagcctcgggcatcgagaggtgcgccataatggacgagcagtgcctggaggacagggtgaacttcgtgctcaggaactacgccaaaagcggtatcaaggagctgggcttgatccccctcgatccgctgcacgtcaagaagttcaaaatcggacgcaatccgcacagtccggtcaacatcgatctcagcttccacgagatggacatcttgggtctgcatcagggagttgcgaagcgagtgaggtgagtggatcctcatttcattttatgatcgctctgcttactacatttttctgtttcggattttagcggattcacaagggatctcagccgctccatcgagctggtcatggaagttccagaaataggagtcagaggaccctactcggtggacggaagaatactcattctgcccatcaccggaaatggcattgctgacatacgcctcagtaagatttgcctcccacagctttgaaatctaaaatttttaatgtgtttctggaattcgcagctagaacaaaggtacgtgcacagatcaaattgaagcgcgtctccaagggcgatcatcaaacctacgccgaggtgatgaacataaaggttgagctggatccatcccatgtgacctaccagctggaaaatctgttcaacggccagaaagatctcagcgagaacatgcacgcgcttatcaatgagaactggaaggacatcttcaatgaactgaaaccgggcattggcgaggccttcggactgatagccaagtcggtggtggacaggatctttggcaaactgccgctcgaacagctctttgtagtctaagaccttagtacaaacaccctaattagtccaaacacaaatcgtaaatatttatttgactttcaaaatacaaatgcaaagcaaataagaaaaactggtaagttcctcatacaaataaaacgtagttgcaaaataaattcaggcactaaaggatttcttatttctaaagtttaagtaaaatacagatttataaaagtgaaaagcaaacacatttgtagttttgccaaataaatgtaaacacagttaaacttatataaatttgttatcaatctcaaaacaggggtaataaatcgttttcattttgattttgtttgtatcgatttgataaatatttttaaaaagcttatataagcttattcacgaatacaaatatggagtccgcactattggacaaatatatcttacactatagatatgtttactttacgaaattattgcttcccatgagaagagtagcttttttaaattgcatatttgctgtcattcttttatcgatgtgcacagcattagtttagcttctgaagcgaggtacacgtccggtgtgacgaggtggcgatgatggcacttcctcggtctcctcaaactcctcctccgacgctgacggctgctccatgctcactgcgccaagcttctgtggcccgcaaatggcgacaacactgggcgggaaataggtctctggattggcgatcagagcggagaggaactcatgctgctccagaacgcaatagatcggacctcccagtttgtatctctgggcgatgaacagatcctccgcactgattccacgcgcttcggccgcccttagctcggcaccggacaaaaggatggcccgctcgttcaacgacttgtggctttcggtcat
What does the sequence mean? cgtaaaactttggccaggcgctctccggtctggtctggtctggtctgttcgtactgctcc gctctctttttccctcaaatgggccaaaaggaggcgacgtcgctgccgcggtcgcagcgc tgccgctgccgcagctaccgccgctgcagacgtcgcttacctgccgaagaagaagagcag cgttcAGTCGCGCAGCGCACGTCGTCCAACGCACACACGCTCAGAGACACACCGACACGC ACACAGATACAGATACGTTGAGTCGCCGCCGCCGCGAAAGATACCAGATACTATCTGCCA GATACGAAGAGTTGGGCCCTATAGTCGTCCCGCTTGCACCCATGGCCGCCTGAGTgtgag tgcaagagcggattggattgagtggaatacgaacgcgattccattccggtccacatccga acccacatccgaatcctatccgaagccacctaacccttgccgaccagcgcttaacccatg tcttcgtctttgtctcgtttcagAGTTGCAAGCGACCATGCGCGCATGGCTTCTACTCCT CGCAGTGCTGGCGACTTTTCAAACGATTGTTCGAGTTGCTAGCACCGAGGATATATCCCA GAGATTCATCGCCGCCATAGCGCCCGTTGCCGCTCATATTCCGCTGGCATCAGCATCAGG ATCAGGATCAGGACGATCTGGATCTAGATCGGTAGGAGCCTCGACCAGCACAGCATTAGC AAAAGCATTTAATCCATTCAGCGAGCCCGCCTCGTTCAGTGATAGTGATAAAAGCCATCG GAGTAAAACAAACAAAAAACCTAGCAAAAGTGACGCGAACCGACAGTTCAACGAAGTGCA TAAGCCAAGAACAGACCAATTAGAAAATTCCAAAAATAAGTCTAAACAATTAGTTAATAA ACCCAACCACAACAAAATGGCTGTCAAGGAGCAGAGGAGCCACCACAAGAAGAGCCACCA CCATCGCAGCCACCAGCCAAAGCAGGCCAGTGCATCCACAGAATCTCATCAATCCTCGTC GATTGAATCAATCTTCGTGGAGGAGCCGACGCTGGTGCTCGACCGCGAGGTGGCCTCCAT CAACGTGCCCGCCAACGCCAAGGCCATCATCGCCGAGCAGGGCCCGTCCACCTACAGCAA GGAGGCGCTCATCAAGGACAAGCTGAAGCCAGACCCCTCCACTCTAGTCGAGATCGAGAA GAGCCTGCTCTCGCTGTTCAACATGAAGCGGCCGCCCAAGATCGACCGCTCCAAGATCAT CATCCCCGAGCCGATGAAGAAGCTCTACGCCGAGATCATGGGCCACGAGCTCGACTCGGT CAACATCCCCAAGCCGGGTCTGCTGACCAAGTCGGCCAACACAGTGCGAAGTTTTACACA CAAAGgtgagtctccttttcaaatgtttaaaaccagaactagaaaaccggaagcggatat agaaaaactttgcattctaatggtattacttttaatacagcgagtatgattccttttgga 5’ UTR 1st exon 5’ intron UTR part of 2nd exon Start codon Coding part of 2nd exon intron
We can make pictures to help us understand the sequence. • 5 alternate transcripts of the gene decapentaplegic (dpp)
SO categorizes the kinds of, parts of and properties of sequence. • How SO is organized - from features and qualities to cross-products • Let’s interoperate - questions that need to be answered.
Structure of SO • Sequence features • 662 terms • have coordinates • Examples: exon, 5’UTR, promoter
Structure of SO part 2 • Sequence attributes • 358 terms • describe sequence features • Examples: imprinted, trans-spliced, fragment • Consequence of mutation - describe mutations such as SNPs . • Example: mutation_causes_exon_loss • (SNP isa sequence_variant synonym = mutation) • Chromosome variation - describes weird chromosomes • Example: interchromosomal_transposition
Cross product terms • 156 terms • A new SO term can be composed from a feature and an attribute • What makes a silenced_gene a special kind of gene, is that it has the quality ‘silenced’. gene silenced silenced_gene genus differentiae
SO categorizes the kinds of, parts of and properties of sequence. • How SO is organized - from features and qualities to cross-products • Let’s interoperate - questions that need to be answered.
Lots of potential to interoperate with SO GO ATP binding eye pigment precursor transporter activity permease activity PATO Phenotype qualities [Term] id: PATO:0000952 name: brown is_a: PATO:0000014 ! color GO Can GO annotators use SO terms to annotate cellular locations? SO Annotation of scarlet gene RNA Ontology MGED Ontology Protein Ontology
Question from the GO group. • Should GO annotators locate gene products to SO terms? An annotator wanted to further specify a protein with DNA binding function. • Examples, promotor, intergenic_region. • GOC decided not to use SO terms directly in the GO annotations, but allow them in to be used as “contextual information”.
Aim: Work out how SO fits into grand scheme of things… entity Exists in 4 dimensions continuant occurant role function quality independent entity dependent entity objects aggregates fiat parts site boundary
Questions we need to be able to answer • Generally, • What kind of thing is an instance of SO? • Specifically, • What is a gene? • What is a genotype? • What is an allele?
Things people have said about sequence • Sequence is a molecular thing • Sequence is a mathematical thing • Sequence is abstract
Is a SO sequence a molecule? The intron sequence has relationships that relate it to other sequences. It is part of a gene, and adjacent to exon sequences. • GATACGAAGAGTTGGGCCCTAGTCGTCCCGCTTGCACCATGCCGCCTGAGTgtgagtgcaagagcggattggattgatggaatacgaacgcgattccattccggtccacatccgaacccacatccgaatcctatccgaagccacctaacccttgccgaccagcgcttaacccatgtcttcgtctttgtctcgtttcagAGTTGCAAGCGACCATGCGCGCATGGCTTCTACTCCT The intron molecule is not related to other sequences. It has 3 dimensional structure. The intron molecule has sequence.
Retroviral gag, pol and env genes are encoded in both RNA and DNA • The retrovirus genome exists as RNA. • It integrates into the host DNA via reverse transcriptase. • The host now contains gag, pol and env.
Is biological sequence mathematical? • It is sequential. • Therefore we can do coordinate based calculations like ‘Are the coordinates of this exon located within the coordinates of the transcript?
Is sequence abstract? • Does sequence exist or is it a quality that is dependent on another substrate? • An exon can be located on a genomic sequence, and an mRNA sequence. • SO is used in a representational way. People annotate where the interesting things are located on a genome.
cgtaaaactttggccaggcgctctccggtctggtctggtctggtctgttcgtactgctcc gctctctttttccctcaaatgggccaaaaggaggcgacgtcgctgccgcggtcgcagcgc tgccgctgccgcagctaccgccgctgcagacgtcgcttacctgccgaagaagaagagcag cgttcAGTCGCGCAGCGCACGTCGTCCAACGCACACACGCTCAGAGACACACCGACACGC ACACAGATACAGATACGTTGAGTCGCCGCCGCCGCGAAAGATACCAGATACTATCTGCCA GATACGAAGAGTTGGGCCCTATAGTCGTCCCGCTTGCACCCATGGCCGCCTGAGTgtgag tgcaagagcggattggattgagtggaatacgaacgcgattccattccggtccacatccga acccacatccgaatcctatccgaagccacctaacccttgccgaccagcgcttaacccatg tcttcgtctttgtctcgtttcagAGTTGCAAGCGACCATGCGCGCATGGCTTCTACTCCT CGCAGTGCTGGCGACTTTTCAAACGATTGTTCGAGTTGCTAGCACCGAGGATATATCCCA GAGATTCATCGCCGCCATAGCGCCCGTTGCCGCTCATATTCCGCTGGCATCAGCATCAGG ATCAGGATCAGGACGATCTGGATCTAGATCGGTAGGAGCCTCGACCAGCACAGCATTAGC AAAAGCATTTAATCCATTCAGCGAGCCCGCCTCGTTCAGTGATAGTGATAAAAGCCATCG GAGTAAAACAAACAAAAAACCTAGCAAAAGTGACGCGAACCGACAGTTCAACGAAGTGCA TAAGCCAAGAACAGACCAATTAGAAAATTCCAAAAATAAGTCTAAACAATTAGTTAATAA ACCCAACCACAACAAAATGGCTGTCAAGGAGCAGAGGAGCCACCACAAGAAGAGCCACCA CCATCGCAGCCACCAGCCAAAGCAGGCCAGTGCATCCACAGAATCTCATCAATCCTCGTC GATTGAATCAATCTTCGTGGAGGAGCCGACGCTGGTGCTCGACCGCGAGGTGGCCTCCAT CAACGTGCCCGCCAACGCCAAGGCCATCATCGCCGAGCAGGGCCCGTCCACCTACAGCAA GGAGGCGCTCATCAAGGACAAGCTGAAGCCAGACCCCTCCACTCTAGTCGAGATCGAGAA GAGCCTGCTCTCGCTGTTCAACATGAAGCGGCCGCCCAAGATCGACCGCTCCAAGATCAT CATCCCCGAGCCGATGAAGAAGCTCTACGCCGAGATCATGGGCCACGAGCTCGACTCGGT CAACATCCCCAAGCCGGGTCTGCTGACCAAGTCGGCCAACACAGTGCGAAGTTTTACACA CAAAGgtgagtctccttttcaaatgtttaaaaccagaactagaaaaccggaagcggatat agaaaaactttgcattctaatggtattacttttaatacagcgagtatgattccttttgga
Acknowledgements • SO is funded by the NIH via the Gene Ontology Consortium grant. • Suzi Lewis, Michael Ashburner, Chris Mungall, John Richter, Judy Blake. (GOC)