Big Data Framework and Data Science in PROTECT

Big Data Framework and Data Science in PROTECT Northeast SRP Annual Meeting Zlatan Feric April 3, 2019

Overview • PROTECT background • Database management framework • Data overview • Analytics examples (papers, ongoing analysis)

Puerto Rico Test site for Exploring Contamination Threats (PROTECT) • NIEHS SRP P42 Center since 2010 • Over 1850 participants to date, close to 1400 completed pregnancies, >3000 data points per participant • Follow-on CRECE project tracks children’s health – fully integrated • Key research questions: • What is the contribution of environmental contaminants to preterm birth in Puerto Rico? • Can we develop better strategies for detection and green remediation to minimize or prevent exposure to environmental contamination? • Anticipated outcomes: • Define the relationship between exposure to environmental contaminants and preterm birth • Develop new technology for discovery, transport and exposure characterization & green remediation of contaminants in karst systems • Broader Impacts: • Support environmental public health practice, policy and awareness around our theme

Key Collaborators and Contributors • David Kaeli (Northeastern U) • Justin Manjourides (Northeastern U) • John Meeker (U. Michigan) • Ingrid Padilla (UPRM) • Zaira Rosario Pabon (UPR-MC) • Deborah Watkins (U. Michigan) • Emily Zimmerman (Northeastern U) • Akram Alshawabkeh (Northeastern U) • Gredia Alvarado (UPR-MC) • Ameera Aker (U. Michigan) • Jose Cordero (U. Georgia) • Shi Dong (Northeastern U) • ZlatanFeric (Northeastern U) • Roger Giese (Northeastern U) • Xiangyu Li (Northeastern U)

The Data Management and Modeling Core • Provides for the secure and reliable storage of, and access to, all data generated or utilized in PROTECT • Dropbox Professional portal used for remote data transfer and data backup (provides built-in encryption) • RedCap is used for data cleaning all Human Subject data • EQuISElectronic Data Processor (EDP) performs data cleaning for all projects, and leverage EQuIS Professional/Enterprise for diverse data management tasks • GitHub is used for sharing code and analytics models between researchers. • Enable cross-Center data harmonization and analysis • Every data item is associated with at least one of the following indices (subject ID, time/day and/or space-GIS) • Complete online data dictionary for the Center • Automated data inventory, analytics and data mining capabilities customized for each project

Database Structure • PROTECT has adopted EQuIS, environmental data management software • Customizable schemas • Efficient data cleaning and reporting procedures • Web-based dashboards for distributed collaborators • Leverage a two-tiered server system for data export and storage • Web server – EQuIS Enterprise 6.3.0.1530 • Database server – Microsoft SQL Server V.10.0.160

Data Management Workflow • Data dictionary • Define data type, data range, dependencies • Python programs for format file automation • Input dictionary, output following format: • 1. XML Schema • Apply the layout according to the PROTECT data dictionary, maps fields to database objects • Supports checks based on data ranges and data type • 2. VB • Supports complex checking: dependency / data conversion • 3. SQL • Yet another data screening step • Enterprise Dashboards • Web-based • SQL reports

Data Overview • Human Subject Core: • 3,233total fields/participant • Presently 16different forms/medical questionnaires • Environmental Data Core: • 1046 wells (236 of them include water contaminant data) • 39 springs (10 of them include water contaminant data) • Tapwater samples from homes. • Targeted Biological Data Core: • Close to ~3M data points! (Blood, Urine, Plasma) • Non-targeted Biological Data Core: • 5 fields, >1B data points in 6 urine samples • Mass-to-charge values • Data peaks

Study area overview

Pregnancy Outcomes from PROTECT • Since 2010, 1351 completed pregnancies and recorded gestational ages • 1156 Full term • 119 Preterm • 76 Early Pregnancy Loss • 10.3% Preterm

The data stored and cleaned.How do we use it to derive insight make decisions?

Technologies Used to Facilitate Analysis Programming models: • Python - sklearn, statsmodels, mlexted, tensorflow • R, SAS Leverage High Performance Computing: • Have HPC infrastructure (NEU Discovery high performance computer) many-node and GPU accelerated analytics.

Case Study 1 – Environmental phthalate exposure and preterm birth in the Puerto Rico Test site for Exploring Contamination Threats (PROTECT) birth cohort 286 pregnancies completed August 2017 or later 173 pregnancies in progress 238 withdrawal Recruited into PROTECT to date N= 1,824 15 stillbirths 45 miscarriages Pregnancies complete between 2011 and July 2017 N = 1,127 Live births N = 1,067 38 missing confirmation of delivery date 1 set of twins Live births included in present analysis N = 1,028

Analytics Example 1 – Environmental phthalate exposure and preterm birth in the Puerto Rico Test site for Exploring Contamination Threats (PROTECT) birth cohort Analysis model: Logistic Regression Found significant log-odds ratios between increase in phthalate concentration and preterm birth outcome • monoethylphthalate (MEP) • mono-n-butyl phthalate (MBP) • mono-isobutyl phthalate (MiBP) • mono-hydroxybutyl phthalate (MHBP) For visit 2, all P-value < .05

Analytics Example 2 – Is Infant Non-Nutritive Suck a Sensitive Measure of Prenatal Environmental Exposures? Analysis model: Linear Regression Based on 90 follow up participants, these preliminary data suggest that certain features of infant suck, particularly NNS frequency and amplitude, are associated with specific prenatal phthalate exposures. • (MCNP) • (MECPP) • (MEHHP) • (MHBP)

Analytics Example 3– A Hybrid Approach to Identifying Key Factors in Environmental Health Studies • Analysis model: • Decision Tree with two splitting criterions: • Information Gain (IG) • Area Under the ROC curve (AUC)

Selected Features

Analytics Example 4– Hierarchical Clustering in Food Frequency Data • Dendrogram shows how participants can be clustered into two groups based on their eating habits

Conclusion • To facilitate data-driven decisions requires a strong data management framework • To derive insight from data using data science requires collaboration with domain experts and utilization of advanced statistical machine learning • Our preliminary findings suggest a strong connection between exposure, lifestyle, and adverse pregnancy.

Questions? Thank You!

Big Data Framework and Data Science in PROTECT

Big Data Framework and Data Science in PROTECT

Presentation Transcript

Big Data and Data Mining

A Statistical Viewpoint on Data Science, Data Mining and Big Data

“Big Data” and Data -Intensive Science (eScience)

Big Data and Data Science (red I590)

Spectrum of Support for Data Movement and Analysis in Big Data Science

Data science Framework

Big Data Training | Big Data Courses | Big Data Online Courses

Big Data Big Data

Data Science and Big Data Analytics training

Big Data helps to Protect Yourself Online

How to Protect Big Data in a Containerized Environment

Big Data for Life Science

Data Science vs. Big Data vs. Data Analytics

Big Innovations in Big Data

Big data and data science: What should we teach?

Big Data and Data Science Development Services