1 / 18

Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment. Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b a School of Information Sciences, University of Tampere, Finland

emmett
Download Presentation

Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment Kati Iltanena, Sami Kiviharjua, Lida Aoa, Martti Juholaa, Ilmari Pyykköb aSchool of Information Sciences, University of Tampere, Finland bSchool of Medicine, University of Tampere, Finland

  2. Introduction Aim of the study: to examine applicability of association rules for analysing effects of genetic and environmental factors on age-related hearing impairment (ARHI) To possibly generate new hypotheses for medical research Association analysis Data mining approach to discover items (variable-value pairs) frequently co-occurring in data Association rules of the form “A → B” generated from frequent item sets Capability to do a complete search efficiently

  3. Introduction Challenge High-dimensional data result in a very large number of association rules. Rules may be overlapping  Postprocessing is needed Focus of the study: to develop an approach to cluster, summarise and represent association rules for easier exploration

  4. ARHI data Originate from a European multicentre study on ARHI Collected in nine medical centres from seven European countries (e.g. Van Laer et al., 2008) 2428 cases: females and males aged 53 to 67 The cases represent the best and the worst hearing thirds of their population at high frequencies (2, 4 and 8 KHz) 1241 cases with ARHI Cases having pathologies (other than ARHI) possibly influencing hearing ability were excluded

  5. ARHI data 764 variables 42 phenotypes and environmental factors Phenotypes: e.g. gender, age, body mass index, blood pressure, diabetes, cardiovascular disease and renal failure Environmental and life style factors: e.g. use of ototoxic medication, exposure to chemicals, exposure to noise, alcohol use, and tobacco smoking 722 single nucleotide polymorphisms (SNPs) from 70 candidate genes

  6. Arhi rules Rules were mined with Magnum Opus from RuleQuest Research. Form for rules: LHS Zhighbest>0.147  Genotype, phenotype, environmental variables From 1 to 3 items “Has a hearing impairment” Zhighbest: averaged gender and age independent Z-score of high frequencies (2, 4 and 8 KHz) for the better hearing ear 0.147: a threshold value given by the expert physician

  7. Interestingness measures used for association rules Support Confidence Lift Statistical significance: Fisher exact test Arhi rules

  8. Clustering ARHI rules Measure of similarity or closeness between two association rules proportion of cases matched by both rules among cases matched by either one or both rules (a variant of a measure presented by Gupta et al., 1999) Intersection of R22 and R26: 187 cases (Both R22 and R26 hold for 187 cases.) Union of R22 and R26: 190 cases The similarity between R22 and R26: 187/190≈0.98

  9. Clustering ARHI rules Clustering method based on graph-theoretic techniques Implemented using Matlab, Java and PostgreSQL • Rule graph • Rules - nodes • Similarities between rules - weights of edges between nodes • Similarities above chosen threshold - connections between nodes • One connected component is a rule subset or cluster. • Clustering – searching for connected components A connected component (a threshold of 0.3 used for the similarity measure).

  10. Summarising rule subsets • Rules represented in html documents • Program implemented using Matlab • Rule subset information is given at different levels of details • Overall summary listing for rule subsets • Number of rules, coverage, main item

  11. Summarising rule subsets • At the next level, rule subset information is enlarged with the information about the other items.

  12. Representing rule subsets • Gene colouring • Marking items of special interest • Important SNPs from earlier studies • Ordering items in rules on the basis of item frequencies

  13. Representing rule subsets • Ordering rules in clusters on the basis of item frequencies

  14. Representing rule subsets

  15. Representing rule subsets • Similarities between the rules in a similarity matrix “Noisy workplace” rules Highly overlapping rules “Solvent exposure” rules

  16. Summary statistics of ARHI rules Common threshold values: lift 1, Fisher exact test: α = 0.01

  17. Conclusions Developed approach simplified the rule exploration by grouping together the rules concerning the same items the rules concerning the same phenomenon enabled the recognition of the overlapping rules possibly suggesting more complex interactions Association analysis detected factors found significant in previous studies concerning this ARHI data enabled more exhaustive analysis of more complex patterns However, the problem of multiple testing has to be remembered. gave new interesting information to the expert physician especially rules concerning osteoporosis

  18. References and acknowledgments The authors are grateful to Baur M, Bille M, Bonaconsa A, Cremers CW, Demeester K, Dhooge I, Diaz-Lacava AN, Espeso A, Fransen E, Hannula S, Hendrickx JJ, Huygen PL, Huyghe J, Huyghe JR, Jensen M, Konings A, Kremer H, Kunst S, Lacava A, Lemkens N, Manninen M, Mazzoli M, Mäki-Torkko E, Orzan E, Parving A, Pawelczyk M, Pfister M, Rajkowska E, Sliwinska-Kowalska M, Sorri M, Steffens M, Stephens D, Topsakal V, Tropitzsch A, Van Camp G, Van de Heyning PH, Van Eyken E, Van Laer L, Verbruggen K, and Wienker TF, for the possibility to use the ARHI data. References Gupta et al., Distance based clustering of association rules In: Intelligent Engineering Systems Through Artificial Neural Networks (Proceedings of ANNIE 1999), ASME Press, 1999, pp. 759-764. Van Laer et al., The grainyhead like 2 gene (GRHL2) alias TFCP2L3, is associated with age-related hearing impairment. Hum Mol Genet 2008: 15: 159-69. Acknowledgments

More Related