300 likes | 310 Views
This study focuses on extracting social and behavioral determinants of sexual health from electronic health records (EHRs). It describes the creation of a corpus and the use of information extraction and classification techniques. The results demonstrate the potential for using EHRs to identify and infer determinants of sexual health.
E N D
Towards the inference of determinants of sexual health from the EHR: corpus creation Information Extraction and Classification S39 Daniel J. Feller, MA, Jason Zucker, MD, Oliver Bear Don’t Walk IV, MS, Bharat Srikishan MS, Roxana Martinez, Henry Evans, Michael T Yin, MD, Peter Gordon, MD, NoémieElhadad, PhD Columbia University
Disclosure • I and my spouse/partner have no relevant relationships with commercial interests to disclose. AMIA 2018 | amia.org
Background: Social & Behavioral Determinants • Social & behavioral determinants of health (SBDH) are non-medical factors that impacthealth outcomes • Retrieval of SBDOH a common source of frustration among nurses and social workers. (Weir 2015) • Targeted HIV and STI prevention strategies have created opportunities for a focus on the SBDH of sexual health AMIA 2018 | amia.org
Much focus on community SBDH, less on individual-level • Community-level factors (eg. access to healthy food, community poverty) influence health outcomes (Kind 2014, Walker 2014) • Publicly-available data (e.g. census data) can aggregated & integrated into the EHR (Kasthurirathne 2017, Cantor 2018) • Community estimates for certain SBDOH variables are likely to be imprecise or biased if within-community variance is high (Ancker 2018) Area Deprivation Index in SF zipcodes
Social determinants expressed in clinical notes (Navathe 2017) • Healthcare providers predominantly document determinants of sexual health in free-text notes(Walsh 2014, Chen 2011)
Related Work: Previous approaches for extracting SBDH • NLP has been successfully applied to extract a broad range of SBDOH: • Smoking status (Uzuner 2008; 1st i2b2 NLP shared task) • Substance Abuse (Yestigen 2017, Melton 2015, Chen 2015, Carrell 2015) • Adverse Childhood Events (Bejan 2017) • Social support (South 2017) • Homelessness (Melton 2016, Gundlapalli 2015, Oreskovic 2017, Bejan 2017) • Gender Identity (Romano 2017, Roblin 2016) • No study has extracted social & behavioral determinants related to sexual health from clinical notes AMIA 2018 | amia.org
Research Gaps • There is no standard set of determinants of sexual health • We describe an expert curation of 38 such SBDHs. • Limited research on expression of SBDOH of sexual health in clinical notes • We detail the challenge of creating such a corpus and report its high-level characteristics. • SBDOH are documented infrequently in the patient record • We describe how semi-supervised learning accelerated the annotation process by recognizing clinical notes likely to contain SBDH content • No studies on high-throughput extraction of SBDOH from patient records • We describe preliminary results to use supervised learning to infer an array of SBDH risk factors from clinical documentation. AMIA 2018 | amia.org
Methods 1Curation of a set of determinants of sexual health Gender Sexual Activity Sexual Orientation History of STIs History of RPR Anal Intercourse Vaginal Intercourse Oral Sex Condom Use Male Female F2M M2F Non-Conforming Bisexual MSW MSM WSM WSW Housing Alcohol Use Drug Use Active Drug Use Marijuana Meth Opioids Cocaine IVDU Active Alcohol Use Alcoholism Social Alcohol Use Homeless Living with Friends Unstable Housing Stable Housing AMIA 2018 | amia.org
Methods 2Annotation at the document level • SBDOH are not always expressed as named entities • eg. “Continues to be frustrated about lack of an apartment” • Documents are less labor-intensive to label than mentions Document classification is a common approach for biomedical information extraction and automated diagnosis coding (Perotte 2014 et. al.) AMIA 2018 | amia.org
Methods 3Semi-supervised learning to support manual annotation 1. Isolated social history sections from >340k notes to learn word embeddings using GloVe Simulated visualization of GloVeembeddings AMIA 2018 | amia.org
Methods 3Semi-supervised learning to support manual annotation 2. Used embeddings to create SBDH domain centroids to find candidate notes AMIA 2018 | amia.org
Methods 3Semi-supervised learning to support manual annotation 3. Ranking of clinical notes using similarity to SBDH centroids AMIA 2018 | amia.org
Methods 4 Experiments with Supervised Learning • χ2tests used to select top 200 features for each classifier • Support Vector Machines were trained for each SBDH label (aka binary relevance) • tf-idf weights 41 UMLS SBDH concepts • 29,284 lexical features • In-house NER + Negex
Results 1: Information Retrieval using Word Embeddings and Document Centroids Machine-assisted Review • Manual Annotation (385 notes) • 61% (235) of clinical notes had >0 SBDH labels • avg. 5.0 SBDH mentions / note • Machine-assisted Review (747 notes) • 98.7% (737) of clinical notes had >0 SBDH labels • avg. 9.0 SBDH mentions / note AMIA 2018 | amia.org
Results 2Gold-Standard Corpus • Characteristics of Corpus • 4,625 annotated notes • 1064 HIV+ individuals • Prevalence of SBDH domains • Substance Use (1512) • Alcohol Use (1430) • Housing Status (1329) • Sexual Orientation (1109) • Sexual History (908) AMIA 2018 | amia.org
Results 3 Gold-Standard Corpus Correlation Matrix amongst SBDH labels SBDH related to active substance displayed strong inter-label correlation Examples: 1. Alcohol Abuse & Cocaine Use (0.47) 2. Methamphetamine & Cocaine use (correlation coefficient = 0.313) 3. Unstable housing & substance abuse (0.232) also displayed a correlation. However, a considerable number of SBDH exhibit little association with other labels. AMIA 2018 | amia.org
Results 4Experiments with supervised learning* • Modest performance inferring domain-level SBDH labels (see table) • more data = better results • Poor performance inferring individual-level SBDH labels • Too little data • Lack of coverage in existing terminologies • 62% of notes with SBDH tagged with 0 relevant UMLS concepts • Diverse lexical realizations of SBDH *Results generated using smaller corpus than in previous slides AMIA 2018 | amia.org
Results 4Experiments with supervised learning • Examples of SBDH Mentions for Alcohol Use • “has continued to relapse on crack and beer since starting treatment 3 months ago” • “Drinks heavily up to 1 pint vodka daily” • “noted that he used occasional social EtOH (scotch) at church functions remotely with no other toxic habits” • “now largely 'back on track' after having picked up her alcohol consuption for a few months” AMIA 2018 | amia.org
Discussion • Our set of 32 individual-level and 6 domain-level indicators can be used to inform future efforts in corpus curation and computational methods for SBDH • Annotation guidelines available at github.com/danieljfeller/SBDSH • SBDH domains were observed infrequently in clinical notes: • Alcohol and substance use were the most prevalent domains (4% of annotated notes) • Sexual orientation was documented in less than 1% of notes. • Wide variation in lexical realizations of SBDH • Ranged from word to multi-word expressions to whole sentences. • The inability to infer the presence of individual-level SBDH likely reflects the limited size of our annotated corpus typical of other document classification systems for medical concept recognition.(Bates 2016, Garla 2011) AMIA 2018 | amia.org
Discussion • Oursemi-supervised approach successfully increased the yield of manual annotation. • Annotators observed 9 distinct SBDH mentions per note throughout the machine-assissted annotation • Compared to 5mentions per note randomly sampled from a cohort of HIV+ individuals. • The utility of distributional semantics techniques for modeling the diverse lexical realizations of SBDH in notes has been established (Bejan 2017) • The success of our approach will allow our research team to increase the size and diversity of our annotated corpus. Isolated word embeddings trained on 300k social history sections AMIA 2018 | amia.org
Future Work • Multi-label classification may be improved by accounting for the observed structure of SBDH labels. • Hierarchically structured sets of SVM have demonstrated improved performance compared for multi-label classification of clinical documents. (Perotte 2014, Zhang 2018) • Document-level SBDH labeling may benefit from document zoning. • Long documents like clinical notes typically contain many words unrelated to the modeling task potentially irrelevant to SBDH (i.e. ‘Review of Systems’) • Structured elements of the EHR such as laboratory tests and diagnosis codes may improve the inference of social determinants compared to using notes alone. • With a larger corpus, neural network with attention layer that could provide transparency for classification decisions, may improve results (Baumel 2017) AMIA 2018 | amia.org
Acknowledgements • Funding: • National Library of Medicine “Training in Biomedical Informatics at Columbia University” (T15 LM007079) • National Institute of General Medical Sciences “Extended Methods and Software Development for Health NLP” (R01GM114355) AMIA 2018 | amia.org
Thank you! Email me at: djf2150@columbia.edu
AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. AMIA 2018 | amia.org
Learning Objectives • After participating in this session the learner should be better able to: • Identify challenges related to… AMIA 2017 | amia.org
Deep Learning Architectures for Document Classification Pros: State-of-art performance Highlights text Pros: Simple Cons: Disregards word order Pros: Considers word order Cons: Uninterpretable CBOW CNN HA-GRU Future Work: Can word embeddings and deep learning achieve better performance compared to the UMLS? Baumel, Tal, et al. "Multi-Label Classification of Patient Notes a Case Study on ICD Code Assignment." (2017) AMIA 2017 | amia.org
doc2vec The ’document vector’ acts as a memory that remembers what is missing from the current context — or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document. https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec AMIA 2017 | amia.org
NLP improves predictive models for HIV Feller, Daniel J., et al. "Using Clinical Notes and Natural Language Processing for Automated HIV Risk Assessment." JAIDS (2018): 160-166. Unigrams + structured EHR data Method: TF-IDF weighting of unigrams Feature selection using chi-square ‘clinical keywords’ predictive of HIV risk: Structured EHR data amphetamine , anal , cervical , cocaine , condom , crack , crisis , enlarged , hepatitis , hiv , homeless , homosexual , ivd , lymph , lymphadenopathy , male , man , men , meningitis , mens , meth , msm , neurosyphillis , pyschiatrist , seronegative , sex , sexual , sti , strep , tb , tested , testing , transgender, unprotected , viral , psychology , pyschiatrist
Background: HIV & STDS • 39,543 HIV infections in 2016 • # of Hepatitis C & Syphilis infections increasing • Clinical decision support has had mixed impact on HIV screening (Schnall 2014) • Development of biomedical prevention modalities (eg. PrEP) has created need for precision approaches to preventative care HIV prevalence in 2016 (Centers for Disease Control) AMIA 2018 | amia.org
Results 3: Annotated Corpus • Statistics on annotated notes • # records • Vocabulary size • Avg. tokens / record • Avg. sentences / record • # SBDH labels • # meta SBDH labels • SBDH label cardinality • SBDH label density • Correlation between SBDH labels AMIA 2017 | amia.org