330 likes | 512 Views
Incorporating N-gram Statistics in the Normalization of Clinical Notes. By Bridget Thomson McInnes. Overview. Ngrams Ngram Statistics for Spelling Correction Spelling Correction Ngram Statistics for Multi Term Identification Multi Term Identification. Ngram.
E N D
Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes
Overview • Ngrams • Ngram Statistics for Spelling Correction • Spelling Correction • Ngram Statistics for Multi Term Identification • Multi Term Identification
Ngram Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient. Bigrams Trigrams her dobutamine stress dobutamine stress echo stress echo showed echo showed mild showed mild aortic mild aortic stenosis aortic stenosis with stenosis with a a subaortic gradient Her dobutamine Dobutamine stress Stress echo Echo showed Showed mild Mild aortic Aortic stenosis Stenosis with With a A subaortic Subaortic gradient
Contingency Tables Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp • n11 = the joint frequency of word1 and word2 • n12 = the frequency word 1 occurs and word 2 does not • n21 = the frequency word 2 occurs and word 1 does not • n22 = the frequency word 1 and word 2 do not occur • npp = the total number of ngrams • n1p, np1, np2, n2p are the marginal counts
Contingency Tables echo ! echo stress 1 0 1 !stress 0 10 10 1 10 11 Her dobutamine 1 Dobutamine stress 1 Stress echo 1 Echo showed 1 Showed mild 1 Mild aortic 1 Aortic stenosis 1 Stenosis with 1 With a 1 A subaortic 1 Subaortic gradient 1
Contingency TablesExpected Values Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp • Expected Values • m11 = (np1 * n1p) / npp • m12 = (np2 * n1p) / npp • m21 = (np1 * n2p) / npp • m22 = (np2 * n2p) / npp
Contingency Tables echo ! echo stress 1 0 1 !stress 0 10 10 1 10 11 • Expected Values • m11 = ( 1 * 1 ) / 11 = 0.09 • m12 = ( 1 * 10) / 11 = 0.91 • m21 = ( 1 * 10) / 11 = 0.90 • m22 = (10 * 10) / 11 = 9.09 What is this telling you? ‘this is’ occurs twice in our example. The expected occurrence of ‘this is’ if they are independent is .09 (m11).
Ngram Statistics • Measures of Association • Log Likelihood Ratio • Chi Squared Test • Odds Ratio • Phi Coefficient • T-Score • Dice Coefficient • True Mutual Information
Log Likelihood Ratio Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Log Likelihood = 2 * ∑ ( nij * log( nij / mij) ) The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum of the ratio of the observed and expected values
Chi Squared Test Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp x2 = ∑ pow( (nij – mij), 2) / mij The chi squared test also measures the difference between the observed values and the expected values. It is the sum of the difference between the observed and expected values
Odds Ratio Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Odds Ratio = (n11 * n22) / (n21 * n12) The odds ratio is the ratio is the total number of times an event takes place to the total number of times that it does not take place. It is the cross product ratio of the 2x2 contingency table and measures the magnitude of association between two words
Phi Coefficient Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2) The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than n12 and n21) and negatively associated if most of the data falls off the diagonal.
T Score Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp T Score = ( n11 – m11 ) / sqrt( n11 ) The tscore determines whether there is some non random association between two words. It is the quotient of your known and expected divided by the square root of your known
Dice Coefficient Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Dice coefficient = 2 * n11 / (np1 + n1p) The dice coefficient depends on the frequency of the events occurring together and their individual frequencies.
True Mutual Information Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp TMI = (nij / npp) * ∑ log( nij / mij) True Mutual Information measures to what extent the observed frequencies differ from the expected.
Spelling Correction • Using context sensitive information through the bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled word. • Given: • First content word prior to the misspelled word • First content word after the misspelled word • List of possible spelling corrections
Spelling Correction Example • Example Sentence: • Her dobutamine stress echo showed mild aurtic stenosis with a subaortic gradient. • List of Possible corrections: • artic • aortic • Statistical Analysis : • Basic Idea
Spelling Correction Statistics Possible 1 : Possible 2: • This allows us to take into consideration finding a bigram with word prior • to the misspelling and after the misspelling • The possible word with its score are then returned
Types of Results • Types of Results • Gspell only • Context sensitive only • Hybrid of both Gspell and Context • Taking the average of the Gspell and context sensitive scores • Note : this turns into a backoff method when no statistical data is found for any of the possibilities • Backoff method • Use only the context sensitive score unless it does not exists then revert to the Gspell score
Preliminary Test Set • Test set : partially scrubbed clinical notes • Size : 854 words • Number of misspellings : 82 • Includes Abbreviations
Preliminary Results GSPELL Results : Context Sensitive Results:
Preliminary Results Hybrid Method Results:
Notes on Log Likelihood • Log Likelihood is used quite often with context sensitive spelling correction • Problem with large sample sizes • The marginal values are very large due to the sample size • Increases the expected values so the actually values are commonly so much lower than the expected values • Very independent and very dependent ngrams end up with the same value • Noticed similar characteristics with true mutual information
Example of Problem hip ! hip follow n11 88951 88962 ! follow 65729 69783140 69848869 65740 69872091 69937831
Conclusions with Preliminary Results • Dice coefficient returns the best results • Phi coefficient returns the second best • Log Likelihood and True Mutual Information should not be used • Need to now test the program with a more extensive test bed which is in the process of being created
NGram Statistics for Multi Term Identification • Can not use previous statistics package • Memory constraints due to the amount of data • Would like to look for longer ngrams • Alternative : Suffix Arrays (Church and Yamamoto) • Reduces the amount of memory • Two Arrays • Contains the corpus • Contains identifiers to the ngrams in the corpus • Two Stacks • Contains the longest common prefix • Contains the document frequency • Allows for ngrams up to the size of the corpus to be found
Suffix Arrays To be or not to be to be or not to be be or not to be or not to be not to be to be be • Each array element is considered a suffix • A Ngram is from a suffix until the end of the array
Suffix Arrays [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be to be or not to be be or not to be or not to be not to be to be be Actual Suffix Array :
Term Frequency • Term frequency (tf) is the number of times a ngram occurs in the corpus • To determine the tf of an ngram: • Sorted the suffix array • tf = j – i + 1 • j = first occurrence • i = last occurrence [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be
Measures of Association • Residual Inverse Document Frequency (RIDF) • RIDF = - log (df / D) + log(1 – exp(-tf/D) ) • Compares the distribution of a term over documents to what would be expected by a random term • Mutual Information (MI) • MI(xYz) = log tf( xYz ) * tf( Y ) tf( xY) * tf( Yz ) • Compares the frequency of the whole to the frequency of the parts
Present Work • Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DX • Retrieved the respective text for each heading • Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the data under each sections • Noticed that different multi terms appear for each of the different sections
Conclusions • Ngram statistics can be applied directly and indirectly to various problems • Directly • Spelling correction • Compound word identification • Term extraction • Name identification • Indirectly • Part of Speech tagging • Information Retrieval • Data Mining
Packages • Two Statistical Packages • Contingency Table approach • Measures for bigrams • Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficient • Measures for trigrams • Log Likelihood and True Mutual Information • Suffix Array approach • Measures for all lengths of ngrams • Residual Inverse Document Frequency and Mutual Information