1 / 30

Intelligent Curation Using Ontologies

Introduction. Developing an automated system for extracting and classifying proteins from newly sequenced genomes BackgroundArchitectureAdvantages. Motivation. Genome sequencing techniques greatly improvedMore whole genomes are being sequenced quickly - lots of data being generatedWithout anal

bree
Download Presentation

Intelligent Curation Using Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Intelligent Curation Using Ontologies K.Wolstencroft

    2. Introduction Developing an automated system for extracting and classifying proteins from newly sequenced genomes Background Architecture Advantages

    3. Motivation Genome sequencing techniques greatly improved More whole genomes are being sequenced quickly - lots of data being generated Without analysis and classification – sequences are simply a series of letters Therefore, data analysis is now the rate-limiting step

    4. Why Classify? Classification and curation of a genome is the first step in understanding the processes and functions happening in an organism Classification enables comparative genomic studies - what is already known in other organisms The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology

    5. Background DNA to Proteins Genome sequencing – produces DNA sequences DNA – blueprint of an organism DNA encodes complex molecules – mostly proteins Proteins are the functional molecules of a cell

    6. Proteins Complex molecules constructed from sequences of amino acids 20 different amino acids with different chemical properties

    7. Proteins Primary Structure Amino acid sequences can be represented as a series of single letters >1A5Y:_ PROTEIN TYROSINE PHOSPHATASE 1B MEMEKEFEQIDKSGSWAAIYQDIRHEASDFPCRVAKLPKNKNRNRYRDVSPFDHSRIKLHQEDNDYINASLIKMEEAQRSYILTQGPLPNTCGHFWEMVWEQKSRGVVMLNRVMEKGSLKCAQYWPQKEEKEMIFEDTNLKLTLISEDIKSYYTVRQLELENLTTQETREILHFHYTTWPDFGVPESPASFLNFLFKVRESGSLSPEHGPVVVHXSAGIGRSGTFCLADTCLLLMDKRKDPSSVDIKKVLLDMRKFRMGLIATAEQLRFSYLAVIEGAKFIMGDSSVQDQWKELSHEDLEPPPEHIPPPPRPPKRILEPHNGKCREFFPN

    8. Proteins Tertiary Structure Sequence determines structure

    9. Searching for Features The relationship between amino acid sequence and eventual protein structure means that we can search for distinct structural (and functional) domains within the sequence Domains could be several amino acids long – or could span most of the protein

    10. Example A search of the linear sequence of protein tyrosine phosphatase type K – identified 9 functional domains >uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa). MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHV SAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNP GTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYI AIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..

    11. Human Expert Annotation Bioinformaticians use a series of tools to identify functional domains Similarity searching, domain/motif identification Tools include – BLAST / INTERPRO Tools simply show presence of domains Use expert knowledge to classify proteins according to domain arrangements Presence / order / number of each important Can an ontology be used to capture this knowledge to the standard of a human annotator?

    12. Protein Family Classification Proteins divided into broad functional classes “Protein Families” Often diagnostic domains/motif signify family membership Initial Study focuses on the protein phosphatase family

    13. The Protein Phosphatases large superfamily of proteins – involved in the removal of phosphate groups from molecules Important proteins in almost all cellular processes Involved in diseases – diabetes and cancer human phosphatases well characterised

    14. Characterisation allows classification Diagnostic phosphatase domains/motifs – sufficient for membership of the protein phosphatase superfamily Other motifs determine a protein’s place within the family This human expert knowledge can be captured and incorporated into the model if the domain organisations are represented in a formal DL OWL ontology

    15. Protein Functional Domains

    16. Determining Class Definitions R2A Contains 2 protein tyrosine phosphatase domains Contains 1 transmembrane domain Contains 4 fibronectin domains Contains 1 immunoglobulin domain Contains 1 MAM domain Contains 1 cadherin-like domain

    17. Protégé OWL Modelling

    18. Requirements Extract phosphatase sequences from rest of protein sequences from a whole genome Identify the domains present in each Compare these sequences to the formal ontology descriptions Classify each protein instance to a place in the hierarchy

    19. Technology

    20. myGrid Workflow extract sequences from whole genome perform simple filtering – patmatdb performs InterproScan to determine domain architecture transform the InterproScan results into abstract OWL instance descriptions

    21. myGrid Workflow

    22. InterproScan Results

    23. Conversion to abstract OWL format restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000340> cardinality(1)) restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR001763> cardinality(1)) restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000387> cardinality(1))

    24. Instance Store Instance Store enables reasoning over individuals Can support much higher numbers of individuals OWL ontology is loaded into the instance store A DL reasoner (racer) is used to compare individuals to the OWL ontology definitions

    25. Instance Store

    26. Example Instances Protein Individual Dual Specificity Phosphatase DUSE restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000340> cardinality(1)) restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000387> cardinality(1)) Ontology Definition of Dual Specificity Phosphatase containsDomain IPR000340 Necessary and Sufficient for class membership Also inherits containsDomain IPR000387 from Parent Class PTP

    27. So Far….. Human phosphatases have been classified using the system The ontology classification performed equally well as expert classification The ontology system refined classification - DUSC contains zinc finger domain Characterised and conserved – but not in classification - DUSA contains a disintegrin domain previously uncharacterised – evolutionarily conserved

    28. Aspergillus fumigatus Phosphatase compliment very different from human >100 human <50 A.fumigatus Whole subfamilies ‘missing’ Different fungi-specific phosphorylation pathways? No requirement for tissue-specific variations? Novel serine/threonine phosphatase with homeobox conserved in aspergillus and closely related species, but not in any other

    29. Conclusions Using ontology allows automated classification to reach the standard of human expert annotation Reasoning capabilities allow interpretation of domain organisation Produces interesting biological questions Allows fast, efficient comparative genomics studies System currently describes protein phosphatases - but possible to expand to other protein families

    30. Acknowledgements Group : myGrid PhD Supervisors: Andy Brass, Robert Stevens Phosphatase Biologist: Lydia Tabernero Ontogrid and NIBHI

More Related