1 / 28

Correspondence analysis for data mining with applications in medicine

Correspondence analysis for data mining with applications in medicine. Annie Morin IRISA France amorin@irisa.fr. Correspondence analysis.

nathan
Download Presentation

Correspondence analysis for data mining with applications in medicine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

  2. Correspondence analysis • Statistical vizualization method for displaying the associations between the levels of a two-contingency table and the distances between the categories of each variable => exploratory method • Usually, Chi-square test for independence in a contingency table

  3. CA • Duality between the row and the columns • Use of the row profiles and of the column profiles • Use of chi-square distance (distributional equivalence) • Factorial analysis method (eigen values of a ad-hoc matrix) and reduction of dimensionality

  4. Example : Frequency table

  5. Row-profiles

  6. Column profile

  7. D4 D1 animal heart forest surgery D2 D3

  8. Distances Between two columns Between two rows

  9. Diagonalization of a « covariance matrix » to find the eigenvalues and corresponding eigenvectors • λ1≥λ2≥…….. ≥λp • Inertia of the cloud is ∑λi =2 / n • Distance to the independence model

  10. Simultaneous representation • Of the rows and of the columns profiles on the same factorial plane • Validity of representation : • Inertia : contributions that describe the proportion of variance explained provided by each element (row or column profile) in building an axis • Quality of representation of each element by the axes

  11. Applications in medicine • Pharmacology • Therapeutic trials (to avoid double blind procedures) : CA allows the physician to follow the evolution of the illness or/and of the therapy • Textual analysis : reports, business intelligence, bibliometry

  12. Application on mucoviscidosis • Mucoviscidosis : rare disease • No specific keywords • No specific magazines • Goal : To define a minimum common vocabulary for the researchers working on mucoviscidosis (clinicians, geneticists, etc..)

  13. SURGEON WORDS GENETICS WORDS TOPIC WORDS HYPOTHESIS : THE TYPICAL WORDS FOR A GIVEN TOPIC ARE INDEPENDENT OF THE TECHNIQUES

  14. Processing • First step of the study : to create a “kernel” base which contains the references of scientific documents used by people working on the disease => 612 publications

  15. 30 axes with a positive side and a negative one • Each side of each axis is characterized by the words with a high relative contribution to the inertia (greatest than a threshold).

  16. DATA • Two-table crossing the 612 documents (summaries) and 850 words • CA on this two-way table

  17. Dimension of a word • The words of a topic are one-dimensional • The words of a filed are multidimensional • The dimension of a word is the number of axis on which this word has a high relative contribution to inertia • If we want to find the minimum common vocabulary, the dimension of a word must be high

  18. MUCOVISCIDOSIS BASE

  19. 81 words have a dimension greatest than 10

  20. Is a high dimension a sufficient condition to characterize the disease? To check it, we use other thematic databases and in each of them, we count the number of documents with at least two words among the previous 81 words.

  21.  5 thematic databases • BREAST CANCER …………………………..9871 doc • POLYAMINES……………………………...12726 doc • LEUCOCYTE INFILTRATED TUMOR ……586 doc • ACUTE LYPMPHOBLAST LEUKEMIA …2063 doc • MUCOVISCIDOSCIS………………………...612 doc

  22. RETRIEVAL STATISTICS WITH THE 81 WORDS

  23. CA of the 5 databases and 81 words

  24. 20 left words

  25. Retrieval statistics with thess 20 words

  26. Conclusion • CA is a very powerful methof to display teh association among variables • It can be used with large datasets (one of the dimension must be « tractable »)

  27. Thanks to Michel Kerbaol for allowing me to use its data on mucoviscidosis • Michel.Kerbaol@univ-rennes1.fr • Software : Qnomis

More Related