280 likes | 461 Views
Correspondence analysis for data mining with applications in medicine. Annie Morin IRISA France amorin@irisa.fr. Correspondence analysis.
E N D
Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr
Correspondence analysis • Statistical vizualization method for displaying the associations between the levels of a two-contingency table and the distances between the categories of each variable => exploratory method • Usually, Chi-square test for independence in a contingency table
CA • Duality between the row and the columns • Use of the row profiles and of the column profiles • Use of chi-square distance (distributional equivalence) • Factorial analysis method (eigen values of a ad-hoc matrix) and reduction of dimensionality
D4 D1 animal heart forest surgery D2 D3
Distances Between two columns Between two rows
Diagonalization of a « covariance matrix » to find the eigenvalues and corresponding eigenvectors • λ1≥λ2≥…….. ≥λp • Inertia of the cloud is ∑λi =2 / n • Distance to the independence model
Simultaneous representation • Of the rows and of the columns profiles on the same factorial plane • Validity of representation : • Inertia : contributions that describe the proportion of variance explained provided by each element (row or column profile) in building an axis • Quality of representation of each element by the axes
Applications in medicine • Pharmacology • Therapeutic trials (to avoid double blind procedures) : CA allows the physician to follow the evolution of the illness or/and of the therapy • Textual analysis : reports, business intelligence, bibliometry
Application on mucoviscidosis • Mucoviscidosis : rare disease • No specific keywords • No specific magazines • Goal : To define a minimum common vocabulary for the researchers working on mucoviscidosis (clinicians, geneticists, etc..)
SURGEON WORDS GENETICS WORDS TOPIC WORDS HYPOTHESIS : THE TYPICAL WORDS FOR A GIVEN TOPIC ARE INDEPENDENT OF THE TECHNIQUES
Processing • First step of the study : to create a “kernel” base which contains the references of scientific documents used by people working on the disease => 612 publications
30 axes with a positive side and a negative one • Each side of each axis is characterized by the words with a high relative contribution to the inertia (greatest than a threshold).
DATA • Two-table crossing the 612 documents (summaries) and 850 words • CA on this two-way table
Dimension of a word • The words of a topic are one-dimensional • The words of a filed are multidimensional • The dimension of a word is the number of axis on which this word has a high relative contribution to inertia • If we want to find the minimum common vocabulary, the dimension of a word must be high
Is a high dimension a sufficient condition to characterize the disease? To check it, we use other thematic databases and in each of them, we count the number of documents with at least two words among the previous 81 words.
5 thematic databases • BREAST CANCER …………………………..9871 doc • POLYAMINES……………………………...12726 doc • LEUCOCYTE INFILTRATED TUMOR ……586 doc • ACUTE LYPMPHOBLAST LEUKEMIA …2063 doc • MUCOVISCIDOSCIS………………………...612 doc
Conclusion • CA is a very powerful methof to display teh association among variables • It can be used with large datasets (one of the dimension must be « tractable »)
Thanks to Michel Kerbaol for allowing me to use its data on mucoviscidosis • Michel.Kerbaol@univ-rennes1.fr • Software : Qnomis