1 / 26

Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction

Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction. Stefan Reckow Max Planck Institute of Psychiatry Volker Tresp Siemens, Corporate Technology. TexPoint fonts used in EMF.

aira
Download Presentation

Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Ontological Prior Knowledge into Relational Learningfor Protein Function Prediction Stefan ReckowMax Planck Institute of PsychiatryVolker TrespSiemens, Corporate Technology TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA

  2. Proteins and Protein Ontologies

  3. Protein and Protein Functions • motivation • proteins – molecular machines in any organism • understanding protein function is essential for all areas of bio-sciences • diverse sources of knowledge about proteins • challenges • experimental determination of functions difficult and expensive • homologies can be misleading • most proteins have several functions

  4. Protein function prediction What function does this protein have? catalytic activity (catalyzes a reaction) isomerase activity intramolecular oxidoreductase activity specificity intramolecular oxidoreductase activity, interconverting aldoses and ketoses triose-phosphate isomerase activity (catalyzes a very specific reaction)

  5. “Function” Ontologies • ontologies are a way of bringing order in the function of proteins • an ontology is a description of concepts of a domain and their relationships • hierarchical representation (subclass-relationship) • tree • directed, acyclic graph

  6. “Complex” Ontology • complex: structure formed by a group of two or more proteins to perfom certain functions concertedly

  7. Ontologies as Great Source of Prior Knowledge in Machine Learning • A considerable amount of community effort is invested in designing ontologies • Typically this prior knowledge is deterministic (logical constraints) • Machine Learning should be able to exploit this knowledge • Interactions of proteins is an important information for predicting function: statistical relational learning

  8. Statistical Relational Learning with the IHRM

  9. Statistical Relational Learning (SRL) • SRL generalizes standard Machine Learning to domains where relations between entities (and not just entity attributes) play a significant role • Examples: PRM, DAPER, MLN, RMN, RDN • The IHRM is an easily applicable general model, performs a cluster analysis of relational domains and requires no structural learning • Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proc. 22nd UAI, 2006 • Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. AAAI 2006

  10. Standard Latent Model for Protein Mixture Models Protein1 Protein2 • In a Bayesian approach, we can permit an infinite number of states in the latent variables and achieve a Dirichlet Process Mixture Model (DPM) • Advantage: the model only uses a finite number of those states; thus no time consuming structural optimization is required

  11. Infinite Hidden Relational Model (IHRM) Protein1 • Permits us to include protein-protein interactions into the model interact Protein3 interact interact Protein2

  12. Ground Network function motif complex Z2 motif interact interact complex Z3 interact Z1 function function motif complex

  13. Experimental Results KDD Cup 2001 • Yeast genome data • 1243 genes/proteins: 862 (training) / 381 (test) • Attributes • Chromosome • Motif (351) [1-6]: A gene might contain one or more characteristic motifs (information about the amino acid sequence of the protein) • Essential • Structural class (24) [1-2] The protein coded by the gene might belong to one or more structural categories (24) [1-2] • Phenotype (11)[1-6] observed phenotypes in the organism • Interaction • Complex (56)[1-3] The expression of the gene can complex with others to form a larger protein • Function (14)[1-4] (cell growth, cell organization, transport, … ) • genes were anonymous

  14. Results Comparison with Supervised Models ROC curve Accuracy Model

  15. IHRM Result Node: gene Link: interaction Color: cluster.

  16. Integrating Ontological Prior Knowledge into the IHRM

  17. Integration of ontologies Deductive closure

  18. Integration of ontologies Zi independent concepts dependent concepts function motif complex translocon cytoskeleton actin filaments microtubules signal peptidase

  19. Experiments: Including “Complex” Ontology Data collected from CYGD of MIPS • 1000 genes/proteins: 800 (Training) / 200 (Test) • Attributes • chromosome, motif, essential, structural class, phenotype, interaction, complex, function • interactions from DIP • usage of ontological knowledge on complex • five levels of hierarchal • in our model 258 nodes (concepts) using 66 top level categories • every protein has at least one complex annotation • After including ontological constraints: about three annotations per protein on average

  20. Results 800 (training) / 200 (test) 200 (training) / 200 (test) w/o ontology: 0.895 with ontology: 0.928 w/o ontology: 0.832 with ontology: 0.894 AUC

  21. Results explicit modeling of dependencies

  22. Results • Grey: in test set • proteins concerned with secretion and transportation • The "Golgi apparatus" works together with the "endoplasmatic reticulum (ER)" as the transport and delivery system of the cell. • "SNARE" proteins help to direct material to the correct destination • Test proteins also "cellular transport" • proteins acting in cell division • control proteins • "Septins“: Septins have several roles throughout the cell cycle and carry out essential functions in cytokinesis • The three highlighted proteins fit into this cluster ( "cell fate" and "cell type differentiation“)

  23. Results sampling convergence

  24. Results Distribution of proteins in the clusters

  25. Results • Grey: former singletons • Cellular Transport Cluster • The former singleton "Clathrin light chain", as a major constituent of coated vesicles (a component for transport) fits into this cluster quite well • Tasks occurring during DNA replication • The former singleton "DNA polymerase", as a main actor in replication, obviously is assigned the correct cluster here

  26. Conclusion • application of the IHRM to function prediction • competitive with supervised learning methods • insights into the solution • advantages of integrating ontological knowledge • improvement of the clustering structure • robustness: stable results with varying parameterization • deductive closure prior to learning is a general powerful principle • future challenges • usage of several or more complex ontologies • further analysis of dependent vs. independent concepts • Acknowledgements: Karsten Borgwardt (MPIs Tübingen); Hans-Peter Kriegel (LMU)

More Related