300 likes | 419 Views
Introduction. Developing an automated system for extracting and classifying proteins from newly sequenced genomes BackgroundArchitectureAdvantages. Motivation. Genome sequencing techniques greatly improvedMore whole genomes are being sequenced quickly - lots of data being generatedWithout anal
E N D
1. Intelligent Curation Using Ontologies K.Wolstencroft
2. Introduction Developing an automated system for extracting and classifying proteins from newly sequenced genomes
Background
Architecture
Advantages
3. Motivation Genome sequencing techniques greatly improved
More whole genomes are being sequenced quickly - lots of data being generated
Without analysis and classification – sequences are simply a series of letters
Therefore, data analysis is now the rate-limiting step
4. Why Classify? Classification and curation of a genome is the first step in understanding the processes and functions happening in an organism
Classification enables comparative genomic studies - what is already known in other organisms
The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology
5. BackgroundDNA to Proteins Genome sequencing – produces DNA sequences
DNA – blueprint of an organism
DNA encodes complex molecules – mostly proteins
Proteins are the functional molecules of a cell
6. Proteins Complex molecules constructed from sequences of amino acids
20 different amino acids with different chemical properties
7. Proteins Primary Structure Amino acid sequences can be represented as a series of single letters
>1A5Y:_ PROTEIN TYROSINE PHOSPHATASE 1B
MEMEKEFEQIDKSGSWAAIYQDIRHEASDFPCRVAKLPKNKNRNRYRDVSPFDHSRIKLHQEDNDYINASLIKMEEAQRSYILTQGPLPNTCGHFWEMVWEQKSRGVVMLNRVMEKGSLKCAQYWPQKEEKEMIFEDTNLKLTLISEDIKSYYTVRQLELENLTTQETREILHFHYTTWPDFGVPESPASFLNFLFKVRESGSLSPEHGPVVVHXSAGIGRSGTFCLADTCLLLMDKRKDPSSVDIKKVLLDMRKFRMGLIATAEQLRFSYLAVIEGAKFIMGDSSVQDQWKELSHEDLEPPPEHIPPPPRPPKRILEPHNGKCREFFPN
8. ProteinsTertiary Structure Sequence determines structure
9. Searching for Features The relationship between amino acid sequence and eventual protein structure means that we can search for distinct structural (and functional) domains within the sequence
Domains could be several amino acids long – or could span most of the protein
10. Example A search of the linear sequence of protein tyrosine phosphatase type K – identified 9 functional domains
>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).
MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHV
SAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNP
GTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYI
AIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..
11. Human Expert Annotation Bioinformaticians use a series of tools to identify functional domains
Similarity searching, domain/motif identification
Tools include – BLAST / INTERPRO
Tools simply show presence of domains
Use expert knowledge to classify proteins according to domain arrangements
Presence / order / number of each important
Can an ontology be used to capture this knowledge to the standard of a human annotator?
12. Protein Family Classification Proteins divided into broad functional classes “Protein Families”
Often diagnostic domains/motif signify family membership
Initial Study focuses on the protein phosphatase family
13. The Protein Phosphatases large superfamily of proteins – involved in the removal of phosphate groups from molecules
Important proteins in almost all cellular processes
Involved in diseases – diabetes and cancer
human phosphatases well characterised
14. Characterisation allows classification Diagnostic phosphatase domains/motifs – sufficient for membership of the protein phosphatase superfamily
Other motifs determine a protein’s place within the family
This human expert knowledge can be captured and incorporated into the model if the domain organisations are represented in a formal DL OWL ontology
15. Protein Functional Domains
16. Determining Class Definitions R2A
Contains 2 protein tyrosine phosphatase domains
Contains 1 transmembrane domain
Contains 4 fibronectin domains
Contains 1 immunoglobulin domain
Contains 1 MAM domain
Contains 1 cadherin-like domain
17. Protégé OWL Modelling
18. Requirements Extract phosphatase sequences from rest of protein sequences from a whole genome
Identify the domains present in each
Compare these sequences to the formal ontology descriptions
Classify each protein instance to a place in the hierarchy
19. Technology
20. myGrid Workflow extract sequences from whole genome
perform simple filtering – patmatdb
performs InterproScan to determine domain architecture
transform the InterproScan results into abstract OWL instance descriptions
21. myGrid Workflow
22. InterproScan Results
23. Conversion to abstract OWL format restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000340> cardinality(1))
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR001763> cardinality(1))
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000387> cardinality(1))
24. Instance Store Instance Store enables reasoning over individuals
Can support much higher numbers of individuals
OWL ontology is loaded into the instance store
A DL reasoner (racer) is used to compare individuals to the OWL ontology definitions
25. Instance Store
26. Example Instances Protein Individual
Dual Specificity Phosphatase DUSE
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000340> cardinality(1))
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000387> cardinality(1))
Ontology Definition of Dual Specificity Phosphatase
containsDomain IPR000340
Necessary and Sufficient for class membership
Also inherits
containsDomain IPR000387 from Parent Class PTP
27. So Far….. Human phosphatases have been classified using the system
The ontology classification performed equally well as expert classification
The ontology system refined classification
- DUSC contains zinc finger domain
Characterised and conserved – but not in classification
- DUSA contains a disintegrin domain
previously uncharacterised – evolutionarily conserved
28. Aspergillus fumigatus Phosphatase compliment very different from human
>100 human <50 A.fumigatus
Whole subfamilies ‘missing’
Different fungi-specific phosphorylation pathways?
No requirement for tissue-specific variations?
Novel serine/threonine phosphatase with homeobox
conserved in aspergillus and closely related species, but not in any other
29. Conclusions Using ontology allows automated classification to reach the standard of human expert annotation
Reasoning capabilities allow interpretation of domain organisation
Produces interesting biological questions
Allows fast, efficient comparative genomics studies
System currently describes protein phosphatases - but possible to expand to other protein families
30. Acknowledgements Group : myGrid
PhD Supervisors: Andy Brass, Robert Stevens
Phosphatase Biologist: Lydia Tabernero
Ontogrid and NIBHI