1 / 95

AUTOMATIC CLASSIFICATION OF MEDICAL REPORT

This PhD thesis explores the automatic classification of medical reports using text mining techniques to assign diagnoses from the ICD10 code classification system. The state-of-the-art on classification methods, algorithms, and evaluation measures is discussed, along with the analysis of a dataset of 33,000 medical reports. The results show promising performance in accurately classifying medical reports into specific diagnostic codes.

koreym
Download Presentation

AUTOMATIC CLASSIFICATION OF MEDICAL REPORT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AUTOMATIC CLASSIFICATION OF MEDICAL REPORT Didier Nakache September, 26, 2007 CEDRIC laboratory – ISID team – CNAM of Paris Didier Nakache - PHD thesis

  2. Summary • Introduction: presentation of the project • State of the art on textmining for classification • EDA and CLO3 algorithm • About evaluation • The rhéa project • Conclusion Didier Nakache - PHD thesis

  3. 1. Presentation Didier Nakache - PHD thesis

  4. General Presentation • Rhea is a decisional tool for ICU (intensive care unit) with two major axes: • Rhea: datawarehouse / datamining • Cirea: textmining • Main thematic of Rhea are Nosocomial Infections and iatrogenic events • The sub project CIREA represents only a small part of Rhea Didier Nakache - PHD thesis

  5. The CIREA Subproject • The aim of the CIREA project is to classify medical report written in natural language by finding diagnostics into 52,000 codes of ICD10 (international classification of diseases, version 10) Didier Nakache - PHD thesis

  6. ICD10 Classification ICD 10 classification is a hierarchical classification of diseases. It contains 52,000 codes. Example : Didier Nakache - PHD thesis

  7. 2. State of the art:classification of textual documents Didier Nakache - PHD thesis

  8. Overview Statistics Key-words Rules and expert system T O D A Y Machine learning Natural language processing … 1970 1980 1990 2000 Today Didier Nakache - PHD thesis

  9. Lexical Tables Salton (1975) introduce vector model where each document is represented by a bag of words organised in vector of words. Each dimension represent a term (or a concept) of the corpus and is used to be weighted. We use to use a contingency table with binary count (present, not present) Didier Nakache - PHD thesis

  10. Main Algorithms • Naïve Bayes, • Decision trees, • TF/IDF, • SVM (Support Vector Machine) (Vapnik 95). Didier Nakache - PHD thesis

  11. Other algorithms Many other algorithms have been used for classification of textual document: • Neural networks, LLFS (Linear Least Squares Fit), KNN, MMC, … etc. Didier Nakache - PHD thesis

  12. Measures Didier Nakache - PHD thesis

  13. distances • From two vectors where each one represents one document, we can compute several distances. The most used is Cosine function: Didier Nakache - PHD thesis

  14. Other measures • Many other measures exists (Kullback-Leibler, Jacquard, etc…) • SMART distance (computes dissimilarities) • Mutual information • Dice coefficient: • information gain: • Salton measure: Didier Nakache - PHD thesis

  15. Which corpus and which measures for evaluation? Didier Nakache - PHD thesis

  16. Corpus • We use to find comparison of algorithms based on Reuters database. This is a collection of documents that appeared on Reuters newswire in 1987. Documents are assembled and indexed with categories. • In medical domain, we use to refer to OHSUMED. It is a dataset of medical papers classified. Didier Nakache - PHD thesis

  17. Evaluation of algorithms An algorithm can be evaluated from a contingency table. It allows to compute precision, recall, and F-Measure precision=a/(a+b), recall=a/(a+c) F-Mesure = ((1+ß²)*Precision*Recall) / ((ß²*Precision)+Recall), with ß²=1 Didier Nakache - PHD thesis

  18. Example of F measure • Consider that correct diagnostics to find are: a,b,c,d • An algorithm propose a,b,e • Precision is p=2/3=0.67 • Recall is r=2/4=0.5 • F-measure is 2*p*r/(p+r)=0.57 Didier Nakache - PHD thesis

  19. Micro and macro average Didier Nakache - PHD thesis

  20. Comparison of algorithms Didier Nakache - PHD thesis

  21. [Dumais et al. 1998] proposed a comparison between ‘Find similar’ (similar to Rocchio), decision trees, Bayesian networks, Naive Bayes, and SVM: Didier Nakache - PHD thesis

  22. Comparaison of algorithms [Yang et Liu 1999] compared different methods on same dataset: SVM, KNN, Neural networks, Naive Bayes and LLFS. Didier Nakache - PHD thesis

  23. Comparison of methods ? Evaluation Indicators: BEP, macro F measure, micro F measure, … Didier Nakache - PHD thesis

  24. REUTERS Didier Nakache - PHD thesis

  25. Results for OHSUMED Didier Nakache - PHD thesis

  26. 3. What we did Didier Nakache - PHD thesis

  27. Analysis of the problem: general information Didier Nakache - PHD thesis

  28. Presentation • We build a dataset of 33 000 medical reports (30000 for learning and 3000 for test), from many different hospitals all over France. • We build a database of concepts: 543 418 words of french language with lemmas, 100 882 medical concepts, 957 medical acronyms, 224 stop words, 1445 medical prefixes and suffixes. Didier Nakache - PHD thesis

  29. Count of diagnostics by report • A medical report has an average of 4.34 diagnostics by patient. Variation is very important (from 1 to 32 diagnostics by patient) with a strong concentration between 1 and 6: Didier Nakache - PHD thesis

  30. Specificity of the problem Distribution of diagnostic codes shows a strong concentration on only few diagnostics. So, an algorithm which would propose only a fixed list based on frequent diagtnostics could get good results but would be unusable and not efficient. 10% of diagnostics can be found in 80% of medical report Didier Nakache - PHD thesis

  31. The EDA Desuffixer algorithm Didier Nakache - PHD thesis

  32. Context EDA results from 2 observations: • There are many orthographical different forms, which make different (for the computer) what is identical, • the medical language has a very strong semantic structure, • We wanted to optimize our algorithms by exploiting these two observations. • EDA runs in two successive and independent phases Didier Nakache - PHD thesis

  33. EDA : step 1 • transform each word in small letters, • separate some characters (“cœur” becomes “coeur”), • remove accents, • remove double letters, • replace some letters by their phonetic equivalent. Didier Nakache - PHD thesis

  34. EDA : step 1 bis • finally, we apply, 37 sequential rules, except if the remaining concept have less than 5 characters. Didier Nakache - PHD thesis

  35. Example EDA étape 1 bis Didier Nakache - PHD thesis

  36. EDA : étape 2 • We did observe a strong semantic structure in medical language. So prefix, suffix and affixes are often signifiant. • We choosed to add concepts from this structure. For example if a word begins by « cardio », we add in the report the concept « heart ». Didier Nakache - PHD thesis

  37. EDA Etape 2 : examples Didier Nakache - PHD thesis

  38. Results Didier Nakache - PHD thesis

  39. CLO3 Algorithm Didier Nakache - PHD thesis

  40. We developped CLO3: an algorithm for multilabel classification with 3 dimension in a fuzzy environnement Didier Nakache - PHD thesis

  41. Textmining Approach for CIREA Concept 1 Concept 2 … Concept i Document 1 Document 2 …. Document j Class 1 Class 2 …. Class k One document = several classes and several concepts. Can we find a direct link between classes and concepts? Didier Nakache - PHD thesis

  42. A fuzzy environnement Didier Nakache - PHD thesis

  43. Origin of CLO3 • CLO3’s origin can be found in both TF/IDF and Naïve Bayes • It is based on the principle that we assume to find a relationship between concepts and diagnostics. Didier Nakache - PHD thesis

  44. Computing Brut weight • We compute for each term the 'Brut weight', defined as follow: • Brut weight = Variance of frequency of concept / average frequency of concept • It is the coefficient of variation which computes concentration of each term in classes. Didier Nakache - PHD thesis

  45. Net weight • We had to find how to suppress rare words, (or they would have a too important weight) . The second step consists in computing the net weight : • Net weight = Brut weight* frequency(couple)*count(couple) • We multiply by frequency and count to give a big weight to occurrences frequently encountered. Doing so, associated diagnostics will be automatically deleted. Didier Nakache - PHD thesis

  46. Suppose a patient suffers of diabete and heart attack. In the medical report, we will find words ‘diabete ’ and ‘heart attack’. So, we will have 4 following relations: But on all medical reports, the frequency of couples ‘concept heart attack – diagnostic diabete' and ‘concept diabete – diagnostic heart attack' will be low. So, when we multiply by that frequency, we suppress not desired link. When we multiply by the count, we increase the same effect. Didier Nakache - PHD thesis

  47. Third step: A Weight • The third step consist in standardisation of computed values of net weight. The result is called ‘A weight’. To do this, we divide each weight by the average of the weight of the diagnostic. Our objective is the accentuation by exponentiation of the result. Doing so, values under 1 (under the average) will be lower, and values over 1 (over the avergae) will be higher. Didier Nakache - PHD thesis

  48. Computing of B weight • Next step consists in computing a new weight. We have been inspired by Naive Bayes and probabilities laws. For each couple ‘concept - diagnostic', we compute: • BWeight = count of couple / Total count of concept Didier Nakache - PHD thesis

  49. CLO3 WEIGHT • From A and B weight, we compute final weight: • CLO3 = AWeight2 * BWeight5 Didier Nakache - PHD thesis

  50. Using CLO3 weight • CLO3 is the weight which values the relation between a concept and a class (a diagnostic). To make a classification, we order the sum of weights for each concept, group by diagnostic. The best results are proposed. We filter result and consider only those over 0.0005 Didier Nakache - PHD thesis

More Related