190 likes | 204 Views
This session explores using logistic regression to assess data completeness for infant mortality analysis in electronic health records, highlighting the importance of predictive ability. The study evaluates risk factors impacting infant mortality rates based on available datasets.
E N D
Using Logistic Regression to Verify Completeness of Electronic Health Records for Infant Mortality Analysis Session:Tools for Measuring Data Quality in EHR Systems Session #: S79 Luge Yang University of Pittsburgh Twitter: #AMIA2017
Disclosure I have no relevant relationships with commercial interests to disclose. www.TsuiLab.com AMIA 2017 | amia.org 2
Learning Objectives • After participating in this session the learner should be better able to: • Evaluate completeness of data based on its ability to predict target of interest www.TsuiLab.com AMIA 2017 | amia.org 3
Background: Infant mortality • Infant mortality rate (IMR) : number of deaths of infants under one year old per 1,000 live births. • It is an important indicator of national well-being because it reflects maternal health, health care quality and access, environmental quality, socioeconomic conditions and public health practices. • Infant mortality rate in U.S. exceeded most high-income and many middle-income countries. • 6.24 infants per1,000 live birthsbetween 2008 and 2012 • Moreover, infant mortality rate In Allegheny County in Pennsylvania is slightly higher compared with U.S. average rate. • 6.64 infants per1,000 live birthsbetween 2008 and 2012 • The infant mortality rate disparity between black infants and white infants in Allegheny County is higher than that of U.S.. • Allegheny County: blank infants IMR is 2.88 times higher than white infants IMR • U.S.: 2.26 • . www.TsuiLab.com AMIA 2017 | amia.org 4
Background: Research objective • Research objective is to evaluate completeness of data based on its ability to predict infant mortality • Maternal health, health care quality and access, socioeconomic disadvantage, environmental exposures, and risky behaviors are broad and interrelated categories of infant mortality risks. • Due to the restrictive access to the data, only partial risk categories can be retrieved. • Whether the available dataset is complete to predict infant mortality? • Conventional completeness evaluation methods are not sufficient to assess predictive ability of dataset. • The predictive ability of dataset is not directly observed. • Data required for prediction are those that can contribute to prediction. • . www.TsuiLab.com AMIA 2017 | amia.org 5
Background: Research datasets • Dataset 1: Magee Obstetric Medical and Infant (MOMI) • MOMI database is generated and maintained by the University of Pittsburgh Medical Center (UPMC) Magee-Womens Hospital (Magee). • The database contains demographic and medical information for all deliveries at Magee. • Data period: 1/1/2002 ~ 12/31/2014 • 117,929 deliveries • 118,130 infants • 85,477 mothers • 110 features • Infant death indicator not available • Dataset 2: Allegheny County Department of Human Services (ACDHS) • 1,008 infant death cases along with death causes in Allegheny County between 2003 and 2013 www.TsuiLab.com AMIA 2017 | amia.org 6
Background: Data linkage between 2 datasets • Data linkage between MOMI and ACDHS 1008 death cases • Direct matching using mother’s SSN, mother’s full name, mother’s date of birth and delivery date • 496 infant death cases wereidentified in MOMI. • 496 infant deaths were caused by 115 diseases and related health problems. • Only 35 of 115 diseases and related health problems were reflected in the MOMI dataset. www.TsuiLab.com AMIA 2017 | amia.org 7
Background: Research question • Whether the MOMI dataset is complete to predict infant mortality? www.TsuiLab.com AMIA 2017 | amia.org 8
Methods: Regression model 1 • Multiple logistic regression • Y denotes occurrence of infant death: Y=1 (presence of infant death) vs. Y=0 (absence of infant death). • denotes -thfeature, =1,…,, whereis number of features in the dataset. • log=+ where are regression coefficients. • Select features with P values less than 0.05. • To evaluate predictive ability, 5 fold cross-validation were performed. • Evaluation metrics: area under the ROC (AUC). www.TsuiLab.com AMIA 2017 | amia.org 9
Methods: Regression model 2 • Simultaneous logistic regression with adjusted P value + multiple logistic regression • Y denotes occurrence of infant death: Y=1 (presence of infant death) vs. Y=0 (absence of infant death). • denotes -thfeature, =1,…,where is number of features in the dataset. • First step: log=+, where and are regression coefficients for -th feature. • Adjusted P value is reported for each feature: Benjamini & Hochberg (BH) method control false discover rate (FDR) at 0.05 • Select features with adjusted P values less than 0.05, denoted as where is number of chosen features. • Second step: log=+ where are selected features and are regression coefficients. • To evaluate predictive ability, 5 fold cross-validation were performed. • Evaluation metrics: area under the ROC (AUC). www.TsuiLab.com AMIA 2017 | amia.org 10
Results: Significant risk factors 1 • Table 1. Significant risk factors identified in MOMI dataset by multiple logistic regression. www.TsuiLab.com AMIA 2017 | amia.org 11
Results: Significant risk factors 2 • Table 2.Demographic risk factors identified in MOMI dataset by simultaneous logistic regression with adjusted P value. www.TsuiLab.com AMIA 2017 | amia.org 12
Results: Significant risk factors 2 • .Table 3.Maternal behavioral and reproductive History risk factors identified in MOMI dataset by simultaneous logistic regression with adjusted P value. www.TsuiLab.com AMIA 2017 | amia.org 13
Results: Significant risk factors 2 • Table 4. Labor and delivery characteristics and complications risk factors identified in MOMI dataset by simultaneous logistic regression with adjusted P value. • . AMIA 2017 | amia.org 14 www.TsuiLab.com
Results: Significant risk factors 2 • Table 5. Maternal and infant disease risk factors identified in MOMI dataset by simultaneous logistic regression with adjusted P value. www.TsuiLab.com AMIA 2017 | amia.org 15
Results: Cross validation • 5 fold cross validation • Regression model 1 (multiple logistic regression): AUC = 0.887 • Regression model 2 (simultaneous logistic regressions with adjusted P value): AUC = 0.92 www.TsuiLab.com AMIA 2017 | amia.org 16
Conclusion and discussion • Conclusion • Cross validation results show AUCs from both regression models are at least 0.887 suggesting MOMI dataset is complete to predict infant mortality. • Discussion • Only 35 among 115 death causes and related health problems are reflected in MOMI dataset. • Although MOMI dataset contains partial list of death causes for all 496 deaths, analyses show that it contains sufficient information to predict infant mortality and thus MOMI dataset is complete based on its predictive ability. www.TsuiLab.com AMIA 2017 | amia.org 17
AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. 18 AMIA 2017 | amia.org