10 likes | 130 Views
ATM. 12917635- Oncogene (6.737). 12970738-Oncogene (6.737). 14500819-Nucleic Acids Res. (6.373). 14499692-Science (23.329). STK11. 12183403 – Cancer Res (8.30). 12234250 – Biochem J (4.326). 12805220 - EMBO J. (12.459). 11853558- Biochem J (4.326).
E N D
ATM 12917635- Oncogene (6.737) 12970738-Oncogene (6.737) 14500819-Nucleic Acids Res. (6.373) 14499692-Science (23.329) STK11 12183403 – Cancer Res (8.30) 12234250 – Biochem J (4.326) 12805220 - EMBO J. (12.459) 11853558- Biochem J (4.326) Set Similarity Measures for Gene Matching Mihail Popescu#, James Keller+, Joyce Mitchell# # Department of Health Management and Informatics;+Department of Electrical and Computer Engineering; University of Missouri-Columbia, Columbia, MO 65211 Example of Similarity Calculation for the Gene Ontology (GO) Dimension Why Similarity Measures? • For a unified clustering approach in a 4D gene space • Gene space dimensions (4D): sequence, microarray expression, literature abstracts (articles), gene ontology (GO) • Two dimensions are numeric (sequence, expression) and two symbolic • The existent symbolic measures are not adequate: • Dice, Jaccard: do not consider the weight of the elements • Maximum and average usually overestimates the or underestimates the similarity, respectively • Example: ATM (human ataxia telangiectasia mutated) and STK11 (serine/threonine kinase 11.) The geneticist assessed these two genes as quasi-similar (similarity ~0.5) because: • they both have protein serine/threonine kinase enzyme activity (they share a kinase domain) • They both cause cancers when mutated, including breast cancer. • s(ATM, STK11)=? (GO dimension) • Algorithm: • 1. Retrieve LocusLink GO annotations: • ATM={4674: “ protein serine/threonine kinase activity”, 3677: ” DNA binding”, 4428 ” inositol/phosphatidylinositol kinase activity”, 7131 : ” meiotic recombination”, 6281 : ” DNA repair”, 7165: ” signal transduction”, 5634: ” nucleus”, 16740: ” transferase activity”, 45786: ” negative regulation of cell cycle”} • STK11={5524: “ ATP binding”, 4674: ” protein serine/threonine kinase activity”, 6468: ” protein amino acid phosphorylation”, 16740: ” transferase activity”} • 2. Compute GO term densities using the Resnik formula [4], the normalized version [.] or the depth in the hierarchy (.) • Calculate the confidence of the pair g(A1, A2) =g(A1)*g(A2) and normalize using maximum value: • The pair-wise similarity values calculated using FMS are: • Similarity calculation: • Using weighted average: s(ATM, STK11)=0.37 • Using Choquet integral: s(ATM, STK11)=0.53 Conclusions • For the GO dimension, the best method of assigning densities was normalizing the information content [4] by the maximum value • The proposed fuzzy similarity measure (FMS) agrees better with our intuition of similarity: if the common elements have a high confidence, then the similarity is stronger. In addition, the non common terms have also a contribution to the similarity since the measure is computed apriori for each term set. • The Choquet similarity measure is much more general, depending only on the fuzzy measure. In addition the optimal fuzzy measure can be learned from examples. • 3. Compute the similarity: Possible similarity measures Example of Similarity Calculation for the Retrieved Abstracts Dimension Expression dimension: real measures (Euclidian measure, etc…). Sequence dimension: sequence similarity measure (Smith-Waterman, Needleman-Wunsch, etc…) GO and Abstract dimension: set similarity. Acknowledgements This research was supported by National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM07089-11. Set similarity measures References • Set similarity: given two gene products, G1 and G2, we can consider them as being represented by collections of terms: Based on the two sets, the goal is to define a natural similarity between G1 and G2 and , denoted as : • Two types of set similarity: • element based (Dice, Jaccard, Cosine, fuzzy measure) • pair of elements based (Maximum, Average, OWA, Choquet) [1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 2001. [2] R. Yager, “Criteria Aggregation Functions Using Fuzzy Measures and the Choquet Integral”, Int. Jour. of Fuzzy Systems, Vol.1, No. 2, December 1999. [3] J.J. Jiang, D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Ontology”, Proc. of Int. Conf. Research on Comp. Linguistics X, 1997, Taiwan. [4] P.W. Lord, R.D. Stevens, A. Brass, C.A. Goble, “Semantic similarity measure as a tool for exploring the gene ontology”, In Pacific Symposium on Biocomputing, pages 601-612, 2003. [5] M. Sugeno, Fuzzy measures and fuzzy integrals: a survey, (M.M. Gupta, G. N. Saridis, and B.R. Gaines, editors) Fuzzy Automata and Decision Processes, pp. 89-102, North-Holland, New York, 1977. [6] S. Raychaduri, R.B. Altman, “A literature-based method for assessing the functional coherence of a gene group”, Bioinformatics, 19(3), pp. 396:401, Feb. 2003. [7]. M. Grabisch, T. Murofushi, and M. Sugeno (eds.), Fuzzy Measures and Integrals: Theory and Applications, Springer-Verlag, 2000. [8]. Hvidsten TR, Komorowski J, Sandvik AK, Laegreid A. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput. 2001;:299-310. [9]. Trupti Joshi. Cellular function prediction for hypothetical proteins using high-throughput data. MS thesis, University of Tennessee, Knoxville, 2003. [10]. Keller J, Popescu M, Mitchell J. Soft Computing Tools for Gene Similarity Measures in Bioinformatics, FLINT-CIBI 2003, Berkeley, Dec 15-18, 2003. • s(ATM, STK11)=? (Abstract dimension) • Algorithm: • Retrieve PubMed abstracts for ATM, STK11 • Calculate all the pair-wise distances based on the MeSH indexing • Keep the 4 best-matching pairs • Find the impact factor for each journal: g(Ai), i=1…8