1 / 33

Incorporating N-gram Statistics in the Normalization of Clinical Notes

Incorporating N-gram Statistics in the Normalization of Clinical Notes. By Bridget Thomson McInnes. Overview. Ngrams Ngram Statistics for Spelling Correction Spelling Correction Ngram Statistics for Multi Term Identification Multi Term Identification. Ngram.

lynnea
Download Presentation

Incorporating N-gram Statistics in the Normalization of Clinical Notes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

  2. Overview • Ngrams • Ngram Statistics for Spelling Correction • Spelling Correction • Ngram Statistics for Multi Term Identification • Multi Term Identification

  3. Ngram Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient. Bigrams Trigrams her dobutamine stress dobutamine stress echo stress echo showed echo showed mild showed mild aortic mild aortic stenosis aortic stenosis with stenosis with a a subaortic gradient Her dobutamine Dobutamine stress Stress echo Echo showed Showed mild Mild aortic Aortic stenosis Stenosis with With a A subaortic Subaortic gradient

  4. Contingency Tables Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp • n11 = the joint frequency of word1 and word2 • n12 = the frequency word 1 occurs and word 2 does not • n21 = the frequency word 2 occurs and word 1 does not • n22 = the frequency word 1 and word 2 do not occur • npp = the total number of ngrams • n1p, np1, np2, n2p are the marginal counts

  5. Contingency Tables echo ! echo stress 1 0 1 !stress 0 10 10 1 10 11 Her dobutamine 1 Dobutamine stress 1 Stress echo 1 Echo showed 1 Showed mild 1 Mild aortic 1 Aortic stenosis 1 Stenosis with 1 With a 1 A subaortic 1 Subaortic gradient 1

  6. Contingency TablesExpected Values Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp • Expected Values • m11 = (np1 * n1p) / npp • m12 = (np2 * n1p) / npp • m21 = (np1 * n2p) / npp • m22 = (np2 * n2p) / npp

  7. Contingency Tables echo ! echo stress 1 0 1 !stress 0 10 10 1 10 11 • Expected Values • m11 = ( 1 * 1 ) / 11 = 0.09 • m12 = ( 1 * 10) / 11 = 0.91 • m21 = ( 1 * 10) / 11 = 0.90 • m22 = (10 * 10) / 11 = 9.09 What is this telling you? ‘this is’ occurs twice in our example. The expected occurrence of ‘this is’ if they are independent is .09 (m11).

  8. Ngram Statistics • Measures of Association • Log Likelihood Ratio • Chi Squared Test • Odds Ratio • Phi Coefficient • T-Score • Dice Coefficient • True Mutual Information

  9. Log Likelihood Ratio Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Log Likelihood = 2 * ∑ ( nij * log( nij / mij) ) The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum of the ratio of the observed and expected values

  10. Chi Squared Test Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp x2 = ∑ pow( (nij – mij), 2) / mij The chi squared test also measures the difference between the observed values and the expected values. It is the sum of the difference between the observed and expected values

  11. Odds Ratio Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Odds Ratio = (n11 * n22) / (n21 * n12) The odds ratio is the ratio is the total number of times an event takes place to the total number of times that it does not take place. It is the cross product ratio of the 2x2 contingency table and measures the magnitude of association between two words

  12. Phi Coefficient Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2) The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than n12 and n21) and negatively associated if most of the data falls off the diagonal.

  13. T Score Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp T Score = ( n11 – m11 ) / sqrt( n11 ) The tscore determines whether there is some non random association between two words. It is the quotient of your known and expected divided by the square root of your known

  14. Dice Coefficient Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Dice coefficient = 2 * n11 / (np1 + n1p) The dice coefficient depends on the frequency of the events occurring together and their individual frequencies.

  15. True Mutual Information Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp TMI = (nij / npp) * ∑ log( nij / mij) True Mutual Information measures to what extent the observed frequencies differ from the expected.

  16. Spelling Correction • Using context sensitive information through the bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled word. • Given: • First content word prior to the misspelled word • First content word after the misspelled word • List of possible spelling corrections

  17. Spelling Correction Example • Example Sentence: • Her dobutamine stress echo showed mild aurtic stenosis with a subaortic gradient. • List of Possible corrections: • artic • aortic • Statistical Analysis : • Basic Idea

  18. Spelling Correction Statistics Possible 1 : Possible 2: • This allows us to take into consideration finding a bigram with word prior • to the misspelling and after the misspelling • The possible word with its score are then returned

  19. Types of Results • Types of Results • Gspell only • Context sensitive only • Hybrid of both Gspell and Context • Taking the average of the Gspell and context sensitive scores • Note : this turns into a backoff method when no statistical data is found for any of the possibilities • Backoff method • Use only the context sensitive score unless it does not exists then revert to the Gspell score

  20. Preliminary Test Set • Test set : partially scrubbed clinical notes • Size : 854 words • Number of misspellings : 82 • Includes Abbreviations

  21. Preliminary Results GSPELL Results : Context Sensitive Results:

  22. Preliminary Results Hybrid Method Results:

  23. Notes on Log Likelihood • Log Likelihood is used quite often with context sensitive spelling correction • Problem with large sample sizes • The marginal values are very large due to the sample size • Increases the expected values so the actually values are commonly so much lower than the expected values • Very independent and very dependent ngrams end up with the same value • Noticed similar characteristics with true mutual information

  24. Example of Problem hip ! hip follow n11 88951 88962 ! follow 65729 69783140 69848869 65740 69872091 69937831

  25. Conclusions with Preliminary Results • Dice coefficient returns the best results • Phi coefficient returns the second best • Log Likelihood and True Mutual Information should not be used • Need to now test the program with a more extensive test bed which is in the process of being created

  26. NGram Statistics for Multi Term Identification • Can not use previous statistics package • Memory constraints due to the amount of data • Would like to look for longer ngrams • Alternative : Suffix Arrays (Church and Yamamoto) • Reduces the amount of memory • Two Arrays • Contains the corpus • Contains identifiers to the ngrams in the corpus • Two Stacks • Contains the longest common prefix • Contains the document frequency • Allows for ngrams up to the size of the corpus to be found

  27. Suffix Arrays To be or not to be to be or not to be be or not to be or not to be not to be to be be • Each array element is considered a suffix • A Ngram is from a suffix until the end of the array

  28. Suffix Arrays [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be to be or not to be be or not to be or not to be not to be to be be Actual Suffix Array :

  29. Term Frequency • Term frequency (tf) is the number of times a ngram occurs in the corpus • To determine the tf of an ngram: • Sorted the suffix array • tf = j – i + 1 • j = first occurrence • i = last occurrence [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be

  30. Measures of Association • Residual Inverse Document Frequency (RIDF) • RIDF = - log (df / D) + log(1 – exp(-tf/D) ) • Compares the distribution of a term over documents to what would be expected by a random term • Mutual Information (MI) • MI(xYz) = log tf( xYz ) * tf( Y ) tf( xY) * tf( Yz ) • Compares the frequency of the whole to the frequency of the parts

  31. Present Work • Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DX • Retrieved the respective text for each heading • Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the data under each sections • Noticed that different multi terms appear for each of the different sections

  32. Conclusions • Ngram statistics can be applied directly and indirectly to various problems • Directly • Spelling correction • Compound word identification • Term extraction • Name identification • Indirectly • Part of Speech tagging • Information Retrieval • Data Mining

  33. Packages • Two Statistical Packages • Contingency Table approach • Measures for bigrams • Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficient • Measures for trigrams • Log Likelihood and True Mutual Information • Suffix Array approach • Measures for all lengths of ngrams • Residual Inverse Document Frequency and Mutual Information

More Related