630 likes | 649 Views
Adapting an Algorithm to a Corpus. Peter Nelson Carleton College. J. Starren, M.D., Ph.D. L. Rasmussen. Project Purpose. In the context of a GWAS on hypothyroidism A particular natural language processing algorithm used to identify contextual features
E N D
Adapting an Algorithm to a Corpus Peter Nelson Carleton College J. Starren, M.D., Ph.D. L. Rasmussen
Project Purpose • In the context of a GWAS on hypothyroidism • A particular natural language processing algorithm used to identify contextual features • Discover and evaluate automatic and semi-automatic methods of adapting that algorithm to a corpus of medical records
Project Motivation • PMRP • eMERGE • Hypothyroidism GWAS • Phenotyping
Project Motivation - PMRP • Marshfield Clinic PMRP • ~ 20,000 people from central WI • EHR and blood samples • Studies in the fields of: • Population Genetics • Genetic Epidemiology • Pharmacogenetics • Leverage genetic data to improve care
Project Motivation - eMERGE • eMERGE Network • Organized by NHGRI • Members • Marshfield Clinic • Vanderbilt • Northwestern • Mayo Clinic • Group Health Cooperative • Genome Wide Association Studies
What is a GWAS? Why Do One? • “[A GWAS] involves rapidly scanning markers across the… genomes of many people to find genetic variations associated with a particular disease.” • “[R]esearchers can use the information to develop better strategies to detect, treat and prevent the disease.” • “…common, complex diseases, such as asthma, cancer, diabetes….” NHGRI website (http://www.genome.gov/20019523)
Hypothyroidism GWAS • Insufficient hormone production by thyroid gland can cause fatigue, weight gain, and other symptoms. • Diagnosable and treatable • About 3% of American population have clinical condition • Different Causes
Hypothyroidism GWAS • eMERGE Study • Identify patients with presumptive Hashimoto’s disease induced hypothyroidism (Cases) • Identify patients with normal thyroid function (Controls) • Genotype cases and controls (by testing for 100,000s of SNPs) • Genome-wide association analysis
Phenotyping in a GWAS • Doctors design an algorithm for phenotyping based on the presence or absence of key procedures, medicines, and conditions in a patient’s medical history • EHR is used as a resource • Coded fields • Unmarked text • Images
Manual vs Electronic Phenotyping • Manual phenotyping by chart abstractors • Accurate (Gold standard) • Far too expensive (~20,000 medical records to process) • Electronic phenotyping by computers • Methods • Query database of coded fields • Natural language processing on free text • OCR and Image Processing on other resources • Comparatively cheap • Sample must be validated by chart abstractors
Natural Language Processing • What is it? • What problems must be solved? • How can they be solved?
Natural Language Processing • Search for concepts in free text of EHR • Simple keyword search insufficient • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”
Natural Language Processing • Search for concepts in free text of EHR • Negated • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”
Natural Language Processing • Search for concepts in free text of EHR • Hypothetical • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”
Natural Language Processing • Search for concepts in free text of EHR • Family History • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”
NegEx • Simple • Performs well • Against gold standard • Against MedLEE • Against straight statistical methods • Recently extended • Hypothetical & Family History • “ConText”
NegEx “There was no evidence of polyps or ulceration.”
NegEx “There was no evidence of polyps or ulceration.”
NegEx “There was no evidence of polyps or ulceration.” ................................................. |
NegEx “There was no evidence of polyps or ulceration.” ................................................. |
NegEx “Rule out H. pylori, gastritis, and gastropathy.”
NegEx “Rule out H. pylori, gastritis, and gastropathy.” ………………………………………|
NegEx “Quantitative PCR testing for BK Virus is negative.”
NegEx “Quantitative PCR testing for BK Virus is negative.” |…………………………………………………
NegEx • “No evidence of spread of cancer to the lungs.” • “No residua of healed fractures can be seen otherwise.”
NegEx ………………………………………………..| • “No evidence of spread of cancerto the lungs.” …………………………………………………………| • “No residua of healed fractures can be seen otherwise.”
NegEx • “No evidence of spread of cancerto the lungs.” • “No residua of healed fractures can be seen otherwise.”
NegEx • NegEx, and therefore ConText, require carefully tuned lists of triggers and pseudotriggers. • How big must a list be to perform well?
Scenarios • Annotated training set used to populate lists • Large unmarked training set used to extend existing lists
Using Annotated Data • NegEx/ConText creators provide annotated excerpts from medical records • Look for associations between words and negation to populate list of triggers • Look for associations between words near triggers and false positives to populate list of pseudotriggers
Identifying Triggers • Create a confusion matrix for each word • Sort words by some statistic based on these confusion matrices • Select or reject top candidate as a trigger • Repeat on yet unexplained sentences until stopping condition met
Identifying Triggers • Statistical measures used • Log-likelihood ratio • Precision (PPV) • Recall (Sensitivity) • F-measure
Log-Likelihood Ratio • Triggers: { }
Log-Likelihood Ratio • Triggers: { no }
Log-Likelihood Ratio • Triggers: { no }
Log-Likelihood Ratio • Triggers: { no, denies }
Log-Likelihood Ratio • Triggers: { no, denies }
Log-Likelihood Ratio • Triggers: { no, denies, not }
Log-Likelihood Ratio • Triggers: { no, denies, not }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative, resolved (post) }
Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative, resolved (post) }
Other Measures • Precision (PPV) • 271 tie for 100% • Poor metric • Recall (sensitivity) • Catches all the same ones as LLR • Also finds “any”, “the”, and “for” • Imprecise metric • F-measure • Identical results to LLR • Good metric
Identifying Pseudotriggers • Use analogous method to find words that predict false-positives • Limit to words next to triggers • Filter out prospects with low precision • Sort by LLR