120 likes | 229 Views
Protein Data Integration through Ontologies. Digital Ecosystems & Business Intelligence Institute, Curtin University of Technology, Perth, Australia http://www.debii.curtin.edu.au/. Outline. Existing Interoperability of Biological Data Inconsistency in Protein Data Sources
E N D
Protein Data Integration through Ontologies Digital Ecosystems & Business Intelligence Institute, Curtin University of Technology, Perth, Australia http://www.debii.curtin.edu.au/ Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Outline • Existing Interoperability of Biological Data • Inconsistency in Protein Data Sources • Need for Protein Ontology • Protein Ontology Project • Protein Ontology (PO) • PO Algebra • PO Instance Store • Selected References Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Existing Interoperability of Biological Data • Biological data must be described in context rather than in isolation. • Databases provide multiple links to other resources, but efficient use of these links requires intelligent retrieval systems. • Attempts have been made to create interdatabase links automatically, restricted to few selected data resources, and with limited accuracy. • An alternative approach is the concept of a warehouse, or a centralized data resource that manages a variety of data collections translated into a common format. • The recently emerged ‘middleware’ approach affords a chance to uncouple data access from data management and to allow for remote retrieval beyond the simple scripts fetching data from external databases. Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Inconsistency in Protein Data Sources • Problem of Synonyms:In many cases, creators use different data descriptors to refer to the same real-world protein data. • Difference of Scope: Often, the authors of protein data sources use the same term to denote multiple meanings. Even if not entirely different, the scope of the intended meaning of a term differs. Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Need for Protein Ontology • We need to develop: • Representation of the semantics of the protein information that is shared and can be used as the basis for interoperability between heterogeneous protein databases. • Query methodology to allow this semantic representation to be used for querying heterogeneous databases. Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Protein Ontology Project • The scope of Protein Ontology Project can succinctly be described by the following components: • Develop a generic methodology for the design of ontology for integration of protein data and information sources. • Develop an ontological model for the representation the data and knowledge regarding proteins. • Develop a query algebra based on the developed ontological model for the purpose of intelligent and dynamic information retrieval from protein data sources. • Evaluate the developed protein ontology framework using data analysis techniques to prove the strengths of the approach. Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Protein Ontology (PO) • We are building Protein Ontology to integrate protein data formats and provide a structured and unified vocabulary to represent protein synthesis concepts. • PO consists of concepts, which are data descriptors for proteomics data and the relationships among these concepts. • PO has: • a hierarchical classification of concepts represented as classes, from general to specific; • a list of attributes related to each concept, for each class; • a set of relationships between classes to link concepts in ontology in more complicated ways then implied by the hierarchy, to promote reuse of concepts in the ontology; and • a set of algebraic operators for querying protein ontology instances. • More details about Protein Ontology are at: http://www.proteinontology.info/ Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Protein Ontology (PO) Digital Ecosystems & Business Intelligence Institute, Perth, Australia
PO Algebra We defined Rules that allows composition of multiple levels of information stored in the ontology for information retrieval (referred to as PO Algebra) • Unary Operator: SELECT • Binary Operator: UNION, INTERSECTION, DIFFERENCE Digital Ecosystems & Business Intelligence Institute, Perth, Australia
PO Instance Store • Stores Protein Data as OWL files. • At the moment contains instances of 7424 proteins families • http://www.proteinontology.info/proteins.php • We did some preliminary investigation on Prion dataset of PO using standard hierarchical mining algorithms (Tan et al., 2006): • Our Group’s Work : MB3-Miner, X3-Miner, IMB3-Miner • Other Works: VTreeMiner, PatternMacther, FREQT Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Mining PO Instance Store Digital Ecosystems & Business Intelligence Institute, Perth, Australia
Selected References • Protein Ontology • Sidhu, A.S., Dillon, T.S. and Chang, E. (2007) Protein Ontology. In Chen, J. and Sidhu, A.S. (eds), Biological Database Modeling. Artech House, New York, 63-80. • SIDHU, A. S., DILLON, T. S. & CHANG, E. (2005) An Ontology for Protein Data Models. IN ZHANG, Y. T., ROUX, C. & ZHUANG, T. G. (Eds.) 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2005). Shanghai, IEEE Engineering in Medicine and Biology Society. • PO Algebra • SIDHU, A. S., DILLON, T. S. & CHANG, E. (2006) Towards Semantic Interoperability of Protein Data Sources. 2nd IFIP WG 2.12 & WG 12.4 International Workshop on Web Semantics (SWWS 2006) in conjunction with OTM 2006. France, Springer-Verlag • Mining PO Instance Store • HADZIC, F., DILLON, T. S., SIDHU, A. S., CHANG, E. & TAN, H. (2006) Mining Substructures in Protein Data. 2006 IEEE Workshop on Data Mining in Bioinformatics (DMB 2006) in conjunction with 6th IEEE ICDM 2006. Hong Kong, IEEE Computer Society. Digital Ecosystems & Business Intelligence Institute, Perth, Australia