330 likes | 785 Views
2. Overview. Causes of Death: A Historical PerspectiveOverview of the California EDRSCause of Death Analysis tool (BECA)NLM MetaMap and the UMLSBECA-MetaMap experimentDiscussion. 2. Historical Perspectives on causes of death. Bills of Mortality (1532)Arose from the need to better understand death rates in medieval England -- plague epidemics(1361,1368,1375,1390,1406,
E N D
1. Using the UMLS MetaMap as a Cause of Death Analyzer Michael Hogarth, MDMichael Resendez, MSUniv. of California, Davis
2. 2 Overview Causes of Death: A Historical Perspective
Overview of the California EDRS
Cause of Death Analysis tool (BECA)
NLM MetaMap and the UMLS
BECA-MetaMap experiment
Discussion
3. Historical Perspectives on causes of death Bills of Mortality (1532)
Arose from the need to better understand death rates in medieval England -- plague epidemics(1361,1368,1375,1390,1406, …)
John Graunt (1620-74)
Used the Bills of Mortality and found an infant death rate of 36% in England -- not previously known or understood
London Bills of Mortality classification
Used by Dr. John Snow to characterize a cholera outbreak traced to a water source in London
Evolved to become the Intl. Classification of Disease (1850’s)
International Classification of Disease(ICD) -- used for the last 150 years
4. 4 CA-EDRS Causes of Death
5. 5 Causes of Death Importance
key epidemiological information is contained in the cause of death
Issues and Challenges
absolutely correct versus ‘close to correct’
absolute correctness requires significant time/effort and manual effort
is ‘close to correct’ in an automated fashion still useful?
Typical process in California
COD --> SuperMICAR --> Stat Master File
turnaround for entire process can be lengthy (~2 years)
could have a trend in causes of death and it would not be known by local jurisdictions for 2 years.
Today in California
a significant number of jurisdictions today don’t wait for the final statistical files from the State office to look at trends --- they *manually* ‘code’ (if they have the staff) -- takes time and funding
6. 6 Preliminary COD classification Possible uses of a preliminary COD classification using automated methods that are ‘close to correct’
early identification of trends in a local jurisdiction
disease vs. injury/poisoning -- coroner referral cross-checking
identify specific infectious causes (encephalitis, cholera, etc..)
What it is not
not for ‘absolutely correct’ cause of death classification
will not replace the nosologist’s expertise in understanding the sequence of events leading to death nor their understanding of ICD-10, with its includes/excludes
7. 7 How to analyze causes of death? Challenges
text is verbatim and thus ‘arbitrary’ (free text)
need to go beyond simple keyword matching
biomedical knowledge and content is vast -- and constantly changing!
A possible approach - text mining and computational linguistic techniques
8. 8 BECA We built BECA, a generic concept analyzer framework that can incorporate any ‘concept identifier’ engine such as NLM MetaMap and other text processing tools
BECA = BECA Enables Concept Analysis
Supports a ‘plug-in’ design for the concept matcher and other components (ie, spell checker)
Designed to support multiple transformations of the text in step-by-step fashion
transformations -- strip special characters, lower case, run it through the concept matcher engine (MetaMap or other), run it through an available spell checker (jazzy spell, etc..)
example transformations
convert to lowercase, remove all punctuation, map string using concept mapper, etc..
First version of BECA uses the NLM MetaMap as a concept mapper
9. 9 BECA system design
10. 10 Example transformations
11. 11 What is NLM MetaMap? The National Library of Medicine’s MetaMap
a free, open source software component built by the NLM Lister Hill Laboratory
uses computational linguistic techniques to map biomedical text to a large corpus of biomedical content (the NLM Unified Medical Language System)
Provides a number of text processing functions
Includes a ‘concept mapper’ that attempts to match phrases with concepts in the UMLS Metathesaurus
Includes a UMLS concept-to-code mapping for multiple coding systems (ICD, SNOMED, etc..)
12. 12 How does MetaMap work? Takes text as input and attempts to identify ‘concepts’ in the text and match them to concepts in a large corpus of phrases and concepts in biomedicine (UMLS Metathesaurus)
The retrieved “candidate” matches include a score that reflects how sure it believes the match is correct
The candidates retrieved include their semantic type
“Disease or Syndrome”, “Injury or Poisoning”, etc...
13. 13 The UMLS Developed by the National Library of Medicine
Derived from over 100 sources (ICD, SNOMED,)
The Unified Medical Language System
A system built to support information retrieval in biomedicine
Used in PubMed, ClinicalTrials.gov, etc..
Consists of:
(1) UMLS Metathesaurus
(2) UMLS Semantic Network
(3) UMLS SPECIALIST Lexicon
14. 14 UMLS in detail UMLS Metathesaurus -- the world’s largest repository of biomedical phrases
1.3 million concepts, 6.4 million unique phrases (concept names)
over 100 source vocabularies (ICD,SNOMED,CPT, etc..)
UMLS SPECIALIST LEXICON
a file that provides individual words found in the UMLS metathesaurus and their linguistic information including grammatical ‘type’ (noun, verb, adjective, adverb, etc..)
UMLS Sematic Network
a set of files that classify the metathesaurus ‘concept’ into a particular type
Examples -- “Disease”, “Injury/Poisoning”, “Neoplasm”, ..
15. 15 MetaMap Algorithm MetaMap’s algorithm consists of four steps
(1) Parsing
using a part-of-speech tagger text is decomposed into one or more noun phrases
“ocular complications of myasthenia gravis” ==> “ocular complications” and “myasthenia gravis”.
noun phrases are processed independently by decomposing them into their grammatical origins
“ocular complications” ==> modifier “ocular” and head of the phrase “complications”
(2) Variant Generation -- ‘variants’ for each phrase are generated using SPECIALIST
variants -- all synonyms of the term, acronyms containing the term, abbreviations, plural/singular variants
each variants has a ‘distance’ score obtained from SPECIALIST
“ocular” - “eye”, “eyes”, “optic”, “opthalmic”, “opthalmia”, “oculus”, “oculi”
16. 16 MetaMap Algorithm MetaMap Algorithm continued
(3) Candidate Retrieval from Metathesaurus
all metathesaurus strings that have at least one of the variants is retrieved
can exclude those where the variant is present in a large number of strings (ie, very common string)
(4) Candidate evaluation -- the MMTX score
each metathesaurus candidate is evaluated by calculating the ‘strength’ of the similarity between the original input phrase and the candidate phrase from metathesaurus
the calculation involves a weighted average of four metrics including distance scores for variants from input noun phrase(variation), whether the phrase is part of the ‘head’ (centrality), ”, ‘coverage’ and ‘cohesiveness’
17. 17 Example BECA MetaMap output
Input phrase: “ocular complications”
18. 18 The question ?Can BECA using the NLM MetaMap be useful in:
Identifying biomedical concepts in a cause of death literal, which is narrative text.
“auto-coding” literals into ICD-10 codes
19. 19 Cause of Death Literals in CA-EDRS CA-EDRS data is a combination of records initiated in EDRS (EDRS counties) and those submitted on paper (non EDRS counties)
Causes of death are verbatim from the certifier and typically entered into EDRS or the typed on a paper certificate by funeral home staff or hospital staff
Overall COD statistics for CA-EDRS
462,564 registered death certificates
985,330 unique literals (phrases) in all COD fields
88,719 unique literals (phrases) in the Immediate Cause of Death field
20. 20 Experiment We randomly selected 1,000 literals from the 88,719 unique literals in the Immediate Cause of Death field
We submitted these “as is” to BECA (MetaMap, no spell checking component)
BECA returned 7.9 candidate matches per literal (7,791 candidates for 1,000 strings)
Candidate scores ranged from 517 - 1000
Match score distribution for the 7,791 candidates
21. 21 Example Output
22. 22 Literals with high score matches >=800
23. 23 High Score Candidate Matches 3,017 (38.7%) of the 7,791 candidates had a score >=800
95.3% of the original literals (953/1000) had at least one candidate with a match score>=800
54.5% of the original literals (545/1000) had at least one candidate with a match score>=900
30.7% of the original literals (307/1000) had at least one candidate with a match score=1000
Note: only 7.5% were the exact string as found the UMLS Metathesaurus
Match score distribution for the 3,017 candidates
24. 24 Semantic Type correct matches BECA with MetaMap correctly categorized 720 (72%) of the literals by semantic type
Of these, “Neoplastic Process” had the highest reliability
25. 25 Wrong matches Semantic types most frequently in error
26. 26 ICD-10 Coding 252 of the 1,000 (25.2%) literals had an ICD-10 matched by BECA-MetaMap
Categories
1 = good match
2 = approximate match (within ICD category)
0 = incorrect code
Results - 97% were good or approximate
82.5% “good match”
14.3% “approximate match”
3.2% “incorrect match”
27. 27 ICD-10 Autocoding data
28. 28 Some interesting challenges “CSTFIOTRDPIRATORY FAILURE”
“CHRONIC ALCOHOLISHM”
“ESOPHAGELA VARICES”
“END STAGE RENAL DOSEASE”
“HEAR FAILURE”
“OVARION CANCER WITH METASTASES”
“LUNF CARCINOMA, METASTATIC”
“PENDING TOX & MICRO”
“SEP[TIC SHOCK”
29. 29 Discussion MetaMap may be useful for preliminary categorization of causes of death by semantic type
Excluding certain semantic types would improve match precision (at the cost of lower # of matches)
BECA-MetaMap only assigned an ICD-10 code 25.2% of the time
If BECA-MetaMap assigned an ICD-10 code, it was correct over in 83% of cases, and near correct in 97% of cases
We found that MetaMap was “confused” if:
there are multiple concepts (noun phrases) in a single string
the phrase has a compound statement (“metastasis to brain and bone” or “gunshot wounds of the head and right arm“
the phrases begin with certain words (ie, complications, etc...)
30. 30 Future Directions for BECA Build a new “concept mapper” to replace MetaMap, and specifically design it to analyze causes of death phrases
include a spell checker
disambiguation for phrases that have compound statements
match SNOMED first, then match to ICD-10 (increases the hit rate for ICD-10 autocoding)
improve performance
implement for ICD-10 includes/excludes using an open source rules engine (jBoss Rules Engine)
31. 31 Credits National Library of Medicine, Lister Hill Lab
University of California
Michael Resendez, MS
Cecil Lynch, MD, MS
California Department of Health (California Department of Public Health)
Terry Trinidad
David Fisher
Debbie McDowell
32. 32 California EDRS Developed by the University of California and California DHS (2004-2005)
Implementation (2005 - 2008)
all death certificates entered into EDRS since Jan 1, 2005
full EDRS (implemented counties)-- DC originates in EDRS and electronically completed locally
KDE EDRS (non-EDRS counties) -- DC completed in standard ‘paper’ fashion, eventually entered by State office into EDRS
June 2007 - where are we?
today --> 510,000 certificates (2005 - present)
Originate locally (EDRS records) or are entered later into EDRS (non-EDRS records)
Today, June 2007, ~ 65% originate locally as EDRS electronic
By Nov 2007 over 90% of all CA records will originate in EDRS
33. 33 Cause of Death Workflow with CA-EDRS CA-EDRS does not provide electronic support for gathering of the COD today