AUTOMATIC CLASSIFICATION OF MEDICAL REPORT

AUTOMATIC CLASSIFICATION OF MEDICAL REPORT Didier Nakache September, 26, 2007 CEDRIC laboratory – ISID team – CNAM of Paris Didier Nakache - PHD thesis

Summary • Introduction: presentation of the project • State of the art on textmining for classification • EDA and CLO3 algorithm • About evaluation • The rhéa project • Conclusion Didier Nakache - PHD thesis

1. Presentation Didier Nakache - PHD thesis

General Presentation • Rhea is a decisional tool for ICU (intensive care unit) with two major axes: • Rhea: datawarehouse / datamining • Cirea: textmining • Main thematic of Rhea are Nosocomial Infections and iatrogenic events • The sub project CIREA represents only a small part of Rhea Didier Nakache - PHD thesis

The CIREA Subproject • The aim of the CIREA project is to classify medical report written in natural language by finding diagnostics into 52,000 codes of ICD10 (international classification of diseases, version 10) Didier Nakache - PHD thesis

ICD10 Classification ICD 10 classification is a hierarchical classification of diseases. It contains 52,000 codes. Example : Didier Nakache - PHD thesis

2. State of the art:classification of textual documents Didier Nakache - PHD thesis

Overview Statistics Key-words Rules and expert system T O D A Y Machine learning Natural language processing … 1970 1980 1990 2000 Today Didier Nakache - PHD thesis

Lexical Tables Salton (1975) introduce vector model where each document is represented by a bag of words organised in vector of words. Each dimension represent a term (or a concept) of the corpus and is used to be weighted. We use to use a contingency table with binary count (present, not present) Didier Nakache - PHD thesis

Main Algorithms • Naïve Bayes, • Decision trees, • TF/IDF, • SVM (Support Vector Machine) (Vapnik 95). Didier Nakache - PHD thesis

Other algorithms Many other algorithms have been used for classification of textual document: • Neural networks, LLFS (Linear Least Squares Fit), KNN, MMC, … etc. Didier Nakache - PHD thesis

Measures Didier Nakache - PHD thesis

distances • From two vectors where each one represents one document, we can compute several distances. The most used is Cosine function: Didier Nakache - PHD thesis

Other measures • Many other measures exists (Kullback-Leibler, Jacquard, etc…) • SMART distance (computes dissimilarities) • Mutual information • Dice coefficient: • information gain: • Salton measure: Didier Nakache - PHD thesis

Which corpus and which measures for evaluation? Didier Nakache - PHD thesis

Corpus • We use to find comparison of algorithms based on Reuters database. This is a collection of documents that appeared on Reuters newswire in 1987. Documents are assembled and indexed with categories. • In medical domain, we use to refer to OHSUMED. It is a dataset of medical papers classified. Didier Nakache - PHD thesis

Evaluation of algorithms An algorithm can be evaluated from a contingency table. It allows to compute precision, recall, and F-Measure precision=a/(a+b), recall=a/(a+c) F-Mesure = ((1+ß²)*Precision*Recall) / ((ß²*Precision)+Recall), with ß²=1 Didier Nakache - PHD thesis

Example of F measure • Consider that correct diagnostics to find are: a,b,c,d • An algorithm propose a,b,e • Precision is p=2/3=0.67 • Recall is r=2/4=0.5 • F-measure is 2*p*r/(p+r)=0.57 Didier Nakache - PHD thesis

Micro and macro average Didier Nakache - PHD thesis

Comparison of algorithms Didier Nakache - PHD thesis

[Dumais et al. 1998] proposed a comparison between ‘Find similar’ (similar to Rocchio), decision trees, Bayesian networks, Naive Bayes, and SVM: Didier Nakache - PHD thesis

Comparaison of algorithms [Yang et Liu 1999] compared different methods on same dataset: SVM, KNN, Neural networks, Naive Bayes and LLFS. Didier Nakache - PHD thesis

Comparison of methods ? Evaluation Indicators: BEP, macro F measure, micro F measure, … Didier Nakache - PHD thesis

REUTERS Didier Nakache - PHD thesis

Results for OHSUMED Didier Nakache - PHD thesis

3. What we did Didier Nakache - PHD thesis

Analysis of the problem: general information Didier Nakache - PHD thesis

Presentation • We build a dataset of 33 000 medical reports (30000 for learning and 3000 for test), from many different hospitals all over France. • We build a database of concepts: 543 418 words of french language with lemmas, 100 882 medical concepts, 957 medical acronyms, 224 stop words, 1445 medical prefixes and suffixes. Didier Nakache - PHD thesis

Count of diagnostics by report • A medical report has an average of 4.34 diagnostics by patient. Variation is very important (from 1 to 32 diagnostics by patient) with a strong concentration between 1 and 6: Didier Nakache - PHD thesis

Specificity of the problem Distribution of diagnostic codes shows a strong concentration on only few diagnostics. So, an algorithm which would propose only a fixed list based on frequent diagtnostics could get good results but would be unusable and not efficient. 10% of diagnostics can be found in 80% of medical report Didier Nakache - PHD thesis

The EDA Desuffixer algorithm Didier Nakache - PHD thesis

Context EDA results from 2 observations: • There are many orthographical different forms, which make different (for the computer) what is identical, • the medical language has a very strong semantic structure, • We wanted to optimize our algorithms by exploiting these two observations. • EDA runs in two successive and independent phases Didier Nakache - PHD thesis

EDA : step 1 • transform each word in small letters, • separate some characters (“cœur” becomes “coeur”), • remove accents, • remove double letters, • replace some letters by their phonetic equivalent. Didier Nakache - PHD thesis

EDA : step 1 bis • finally, we apply, 37 sequential rules, except if the remaining concept have less than 5 characters. Didier Nakache - PHD thesis

Example EDA étape 1 bis Didier Nakache - PHD thesis

EDA : étape 2 • We did observe a strong semantic structure in medical language. So prefix, suffix and affixes are often signifiant. • We choosed to add concepts from this structure. For example if a word begins by « cardio », we add in the report the concept « heart ». Didier Nakache - PHD thesis

EDA Etape 2 : examples Didier Nakache - PHD thesis

Results Didier Nakache - PHD thesis

CLO3 Algorithm Didier Nakache - PHD thesis

We developped CLO3: an algorithm for multilabel classification with 3 dimension in a fuzzy environnement Didier Nakache - PHD thesis

Textmining Approach for CIREA Concept 1 Concept 2 … Concept i Document 1 Document 2 …. Document j Class 1 Class 2 …. Class k One document = several classes and several concepts. Can we find a direct link between classes and concepts? Didier Nakache - PHD thesis

A fuzzy environnement Didier Nakache - PHD thesis

Origin of CLO3 • CLO3’s origin can be found in both TF/IDF and Naïve Bayes • It is based on the principle that we assume to find a relationship between concepts and diagnostics. Didier Nakache - PHD thesis

Computing Brut weight • We compute for each term the 'Brut weight', defined as follow: • Brut weight = Variance of frequency of concept / average frequency of concept • It is the coefficient of variation which computes concentration of each term in classes. Didier Nakache - PHD thesis

Net weight • We had to find how to suppress rare words, (or they would have a too important weight) . The second step consists in computing the net weight : • Net weight = Brut weight* frequency(couple)*count(couple) • We multiply by frequency and count to give a big weight to occurrences frequently encountered. Doing so, associated diagnostics will be automatically deleted. Didier Nakache - PHD thesis

Suppose a patient suffers of diabete and heart attack. In the medical report, we will find words ‘diabete ’ and ‘heart attack’. So, we will have 4 following relations: But on all medical reports, the frequency of couples ‘concept heart attack – diagnostic diabete' and ‘concept diabete – diagnostic heart attack' will be low. So, when we multiply by that frequency, we suppress not desired link. When we multiply by the count, we increase the same effect. Didier Nakache - PHD thesis

Third step: A Weight • The third step consist in standardisation of computed values of net weight. The result is called ‘A weight’. To do this, we divide each weight by the average of the weight of the diagnostic. Our objective is the accentuation by exponentiation of the result. Doing so, values under 1 (under the average) will be lower, and values over 1 (over the avergae) will be higher. Didier Nakache - PHD thesis

Computing of B weight • Next step consists in computing a new weight. We have been inspired by Naive Bayes and probabilities laws. For each couple ‘concept - diagnostic', we compute: • BWeight = count of couple / Total count of concept Didier Nakache - PHD thesis

CLO3 WEIGHT • From A and B weight, we compute final weight: • CLO3 = AWeight2 * BWeight5 Didier Nakache - PHD thesis

Using CLO3 weight • CLO3 is the weight which values the relation between a concept and a class (a diagnostic). To make a classification, we order the sum of weights for each concept, group by diagnostic. The best results are proposed. We filter result and consider only those over 0.0005 Didier Nakache - PHD thesis

AUTOMATIC CLASSIFICATION OF MEDICAL REPORT