950 likes | 963 Views
This PhD thesis explores the automatic classification of medical reports using text mining techniques to assign diagnoses from the ICD10 code classification system. The state-of-the-art on classification methods, algorithms, and evaluation measures is discussed, along with the analysis of a dataset of 33,000 medical reports. The results show promising performance in accurately classifying medical reports into specific diagnostic codes.
E N D
AUTOMATIC CLASSIFICATION OF MEDICAL REPORT Didier Nakache September, 26, 2007 CEDRIC laboratory – ISID team – CNAM of Paris Didier Nakache - PHD thesis
Summary • Introduction: presentation of the project • State of the art on textmining for classification • EDA and CLO3 algorithm • About evaluation • The rhéa project • Conclusion Didier Nakache - PHD thesis
1. Presentation Didier Nakache - PHD thesis
General Presentation • Rhea is a decisional tool for ICU (intensive care unit) with two major axes: • Rhea: datawarehouse / datamining • Cirea: textmining • Main thematic of Rhea are Nosocomial Infections and iatrogenic events • The sub project CIREA represents only a small part of Rhea Didier Nakache - PHD thesis
The CIREA Subproject • The aim of the CIREA project is to classify medical report written in natural language by finding diagnostics into 52,000 codes of ICD10 (international classification of diseases, version 10) Didier Nakache - PHD thesis
ICD10 Classification ICD 10 classification is a hierarchical classification of diseases. It contains 52,000 codes. Example : Didier Nakache - PHD thesis
2. State of the art:classification of textual documents Didier Nakache - PHD thesis
Overview Statistics Key-words Rules and expert system T O D A Y Machine learning Natural language processing … 1970 1980 1990 2000 Today Didier Nakache - PHD thesis
Lexical Tables Salton (1975) introduce vector model where each document is represented by a bag of words organised in vector of words. Each dimension represent a term (or a concept) of the corpus and is used to be weighted. We use to use a contingency table with binary count (present, not present) Didier Nakache - PHD thesis
Main Algorithms • Naïve Bayes, • Decision trees, • TF/IDF, • SVM (Support Vector Machine) (Vapnik 95). Didier Nakache - PHD thesis
Other algorithms Many other algorithms have been used for classification of textual document: • Neural networks, LLFS (Linear Least Squares Fit), KNN, MMC, … etc. Didier Nakache - PHD thesis
Measures Didier Nakache - PHD thesis
distances • From two vectors where each one represents one document, we can compute several distances. The most used is Cosine function: Didier Nakache - PHD thesis
Other measures • Many other measures exists (Kullback-Leibler, Jacquard, etc…) • SMART distance (computes dissimilarities) • Mutual information • Dice coefficient: • information gain: • Salton measure: Didier Nakache - PHD thesis
Which corpus and which measures for evaluation? Didier Nakache - PHD thesis
Corpus • We use to find comparison of algorithms based on Reuters database. This is a collection of documents that appeared on Reuters newswire in 1987. Documents are assembled and indexed with categories. • In medical domain, we use to refer to OHSUMED. It is a dataset of medical papers classified. Didier Nakache - PHD thesis
Evaluation of algorithms An algorithm can be evaluated from a contingency table. It allows to compute precision, recall, and F-Measure precision=a/(a+b), recall=a/(a+c) F-Mesure = ((1+ß²)*Precision*Recall) / ((ß²*Precision)+Recall), with ß²=1 Didier Nakache - PHD thesis
Example of F measure • Consider that correct diagnostics to find are: a,b,c,d • An algorithm propose a,b,e • Precision is p=2/3=0.67 • Recall is r=2/4=0.5 • F-measure is 2*p*r/(p+r)=0.57 Didier Nakache - PHD thesis
Micro and macro average Didier Nakache - PHD thesis
Comparison of algorithms Didier Nakache - PHD thesis
[Dumais et al. 1998] proposed a comparison between ‘Find similar’ (similar to Rocchio), decision trees, Bayesian networks, Naive Bayes, and SVM: Didier Nakache - PHD thesis
Comparaison of algorithms [Yang et Liu 1999] compared different methods on same dataset: SVM, KNN, Neural networks, Naive Bayes and LLFS. Didier Nakache - PHD thesis
Comparison of methods ? Evaluation Indicators: BEP, macro F measure, micro F measure, … Didier Nakache - PHD thesis
REUTERS Didier Nakache - PHD thesis
Results for OHSUMED Didier Nakache - PHD thesis
3. What we did Didier Nakache - PHD thesis
Analysis of the problem: general information Didier Nakache - PHD thesis
Presentation • We build a dataset of 33 000 medical reports (30000 for learning and 3000 for test), from many different hospitals all over France. • We build a database of concepts: 543 418 words of french language with lemmas, 100 882 medical concepts, 957 medical acronyms, 224 stop words, 1445 medical prefixes and suffixes. Didier Nakache - PHD thesis
Count of diagnostics by report • A medical report has an average of 4.34 diagnostics by patient. Variation is very important (from 1 to 32 diagnostics by patient) with a strong concentration between 1 and 6: Didier Nakache - PHD thesis
Specificity of the problem Distribution of diagnostic codes shows a strong concentration on only few diagnostics. So, an algorithm which would propose only a fixed list based on frequent diagtnostics could get good results but would be unusable and not efficient. 10% of diagnostics can be found in 80% of medical report Didier Nakache - PHD thesis
The EDA Desuffixer algorithm Didier Nakache - PHD thesis
Context EDA results from 2 observations: • There are many orthographical different forms, which make different (for the computer) what is identical, • the medical language has a very strong semantic structure, • We wanted to optimize our algorithms by exploiting these two observations. • EDA runs in two successive and independent phases Didier Nakache - PHD thesis
EDA : step 1 • transform each word in small letters, • separate some characters (“cœur” becomes “coeur”), • remove accents, • remove double letters, • replace some letters by their phonetic equivalent. Didier Nakache - PHD thesis
EDA : step 1 bis • finally, we apply, 37 sequential rules, except if the remaining concept have less than 5 characters. Didier Nakache - PHD thesis
Example EDA étape 1 bis Didier Nakache - PHD thesis
EDA : étape 2 • We did observe a strong semantic structure in medical language. So prefix, suffix and affixes are often signifiant. • We choosed to add concepts from this structure. For example if a word begins by « cardio », we add in the report the concept « heart ». Didier Nakache - PHD thesis
EDA Etape 2 : examples Didier Nakache - PHD thesis
Results Didier Nakache - PHD thesis
CLO3 Algorithm Didier Nakache - PHD thesis
We developped CLO3: an algorithm for multilabel classification with 3 dimension in a fuzzy environnement Didier Nakache - PHD thesis
Textmining Approach for CIREA Concept 1 Concept 2 … Concept i Document 1 Document 2 …. Document j Class 1 Class 2 …. Class k One document = several classes and several concepts. Can we find a direct link between classes and concepts? Didier Nakache - PHD thesis
A fuzzy environnement Didier Nakache - PHD thesis
Origin of CLO3 • CLO3’s origin can be found in both TF/IDF and Naïve Bayes • It is based on the principle that we assume to find a relationship between concepts and diagnostics. Didier Nakache - PHD thesis
Computing Brut weight • We compute for each term the 'Brut weight', defined as follow: • Brut weight = Variance of frequency of concept / average frequency of concept • It is the coefficient of variation which computes concentration of each term in classes. Didier Nakache - PHD thesis
Net weight • We had to find how to suppress rare words, (or they would have a too important weight) . The second step consists in computing the net weight : • Net weight = Brut weight* frequency(couple)*count(couple) • We multiply by frequency and count to give a big weight to occurrences frequently encountered. Doing so, associated diagnostics will be automatically deleted. Didier Nakache - PHD thesis
Suppose a patient suffers of diabete and heart attack. In the medical report, we will find words ‘diabete ’ and ‘heart attack’. So, we will have 4 following relations: But on all medical reports, the frequency of couples ‘concept heart attack – diagnostic diabete' and ‘concept diabete – diagnostic heart attack' will be low. So, when we multiply by that frequency, we suppress not desired link. When we multiply by the count, we increase the same effect. Didier Nakache - PHD thesis
Third step: A Weight • The third step consist in standardisation of computed values of net weight. The result is called ‘A weight’. To do this, we divide each weight by the average of the weight of the diagnostic. Our objective is the accentuation by exponentiation of the result. Doing so, values under 1 (under the average) will be lower, and values over 1 (over the avergae) will be higher. Didier Nakache - PHD thesis
Computing of B weight • Next step consists in computing a new weight. We have been inspired by Naive Bayes and probabilities laws. For each couple ‘concept - diagnostic', we compute: • BWeight = count of couple / Total count of concept Didier Nakache - PHD thesis
CLO3 WEIGHT • From A and B weight, we compute final weight: • CLO3 = AWeight2 * BWeight5 Didier Nakache - PHD thesis
Using CLO3 weight • CLO3 is the weight which values the relation between a concept and a class (a diagnostic). To make a classification, we order the sum of weights for each concept, group by diagnostic. The best results are proposed. We filter result and consider only those over 0.0005 Didier Nakache - PHD thesis