590 likes | 899 Views
Machine Learning in Public Health. Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek . Data Driven. Process/Experiment. Hypothesis. Design. Experiment. Data. Data. Data analysis. Result.
E N D
Machine Learning in Public Health Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek
Data Driven Process/Experiment Hypothesis Design Experiment Data Data Data analysis Result No Prior Hypothesis Science in 21st Century Traditional “If your experiment needs statistics, you ought to have done a better experiment”Ernest Rutherford
Public Health Challenges • Drug Design Predict bio-activities of small molecules Subtask: Drug metabolism • Control of Infectious Diseases Use pathogen DNA fingerprinting to track and control disease Subtask: Tuberculosis
CYP450-mediated metabolism of drug-like molecules Charles Bergeron , Jed Zaretzki,, Curt Breneman, and Kristin Bennett NIH Molecular Roadmaps Initiative 1P20H6003899-01 • Motivation • Identify the problem • Customized Machine Learning Method • Results • Conclusions
Clozaril pill Clozapine molecule Drug Metabolism • The rate limiting step in the metabolism of drugs by enzyme cytochrome CYP3A4 is hydrogen atom abstraction (removal).
Motivation: Why is this important? • CYP450 isozymes metabolize the majority of drugs in clinical use • 3A4, 2D6, and 2C9 respectively metabolize 50%, 25%, and 16% of drugs on the market • Prediction of metabolic sites on lead candidates can circumvent metabolic liabilities later in the discovery pipeline, as well as aid pro-drug design. • While In vitro techniques are increasingly high throughput, the in silicoidentification of metabolic liability early on in the drug discovery process will allow for the prevention of taking forward certain drug candidates
Identifying the problem • Developing a predictive model of regioselective metabolism by a CYP 450 isozyme • Issues: • For a given molecule, only the site of metabolism with the fastest reaction rate is known • There is no information about relative rates of metabolism for other sites on the molecule • Relative reaction rates between different molecules are unknown
Identifying the Problem: A racing metaphor FINISH Race 1 Race 2 Race 3 OXIDIZE Molecule of Lidocaine
Representation: Identifying distinct regions of a molecule - Metabolophores • Topologically equivalent groups of hydrogens are identified, where abstraction of any member of the group results in the same metabolite.
H H H H H H H H H H H H H H H H H H H H H H H H H Lidocaine Lidocaine Metabalophore 1 Metabalophore 2 Metabalophore 3 Metabalophore 4 Metabalophore 5 Metabalophore 6 Base atom Descriptors Metabalophore 7 Metabalophore 5 Atom Descriptors Green group designates the experimentally determined site of metabolism • Non-hydrogen bond count • Hydrogen bond count • Span • Ring information • Rotatable bonds • Physical environment • Distribution of atom types at 1, 2, 3 • and 4 bonds away from base atom • AM1 charge • Hydrophobic moment • Bond length • Surface area
Group 1 Group 1 Group 2 H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H8 H H H H H H H H H H H7 Group 3 Group 2 Group 2 Group 3 Group 4 Customized Model: From Chemistry to Machine Learning Molecule 2 First Try: Classification Is this hydrogen group abstracted or not? Separate abstracted groups from all other groups. Molecule 1 Group 1 Group 3 Group 4 Group 4 Molecule 3 Molecule 4 Group 3 Group 2 Group 5 Group 6 Group 1
Group 1 Group 1 Group 2 H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H7 H8 H H H H H H H H H Group 4 Group 3 Group 2 Group 2 Group 3 Almost Multiple Instance Classification Molecule 2 Molecule 1 • Is a hydrogen in the group abstracted or not? • Separate at least one hydrogen in each Abstracted Group from all other groups Group 1 Group 3 Group 4 Group 4 Molecule 3 Molecule 4 Group 2 Group 3 Group 5 Group 1 Group 6
Group 1 Group 2 Group 1 H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H7 H8 H H H H H H H H H H Group 4 Group 2 Group 3 Group 2 Group 3 Learning Model:Multiple Instance Ranking (Bergeron et al., 2008) Molecule 2 Molecule 1 Which group with the molecule will be preferred? MIRankfinds a single ranking function across multiple molecules Group 1 Group 3 Group 4 Group 4 Molecule 3 Molecule 4 Group 3 Group 2 Group 5 Group 6 Group 1
It is posed as a bilinear optimization problem The source code, data and paper are available online http://reccr.chem.rpi.edu/MIRank/ Learning Model: Multiple Instance Ranking (Bergeron et al., 2008) Empirical risk Tradeoff parameter Regularization Model Bilinear constraint. Convex combination weights sum to one. Convex combination weights are nonnegative. http://reccr.chem.rpi.edu/MIRank/ Empirical risk terms are nonnegative.
Results: Comparison with other methods • Our descriptors and modeling techniques take advantage the inherent molecule /metabolophore structure of the problem to effectively utilize limited experimental information. • Results statistically equivalent to previously published results • Predictions published by Sheridan utilize methods proprietary to Merck, while Metasite is a commercial product. • We are developing our method into publically available tool for online metabolic site predictions.
Results in Blind test for Major Pharmaceutical Company “Long story short, we're very impressed with the predictions for this preliminary test set of 20 compounds. If we had not had experimental data yet for these compounds, the predictions would have been very useful in directing our chemistry teams to the major or minor metabolic hot-spot for a large majority of the compounds.” (85% accuracy)
Conclusions and Directions: • Most accurate public domain method for hydrogen abstraction (online prototype). • Metabolite can be accurately determine using predicted metabalophore. • Customize machine learning to the task • Model enhancements: Nonlinear Kernel function, Multi-task learning across isozymes, Multi-level model • Algorithm enhancements: New faster class of nonsmoothnonconvex bundle methods for multiple instance learning.
TB-TRACK Disease Control and Dynamics Laboratory Tuberculosis Tracking and Control Amina Shabeer, Cagri Ozgalar, S. Vandenberg, B. Yener, L. Cowan, J. Driscoll, K. Bennett and more at CDC, NYCDOH, NYDOH, PHRI, Institut Pasteur NIH R01 LM009731 cs.rpi.edu/~bennek/tbtrack
Tackling Tuberculosis • More than 8 million new cases, 2.5 million deaths a year worldwide • WHO: 1/3 of world population is infected • Strong association with HIV epidemics, poverty • Emergence of multidrug-resistant strains • Extremely difficult to control • Goal:Use DNA fingerprinting of TB bacteria totrack spread of TB, detect new outbreaks, guide control efforts
Genotyping helps TB Control Two students/employees sick with TB. TB Controller: Find source(s) of infection in order to identify people who need treatment and stop future transmission. Genotype TB bacteria to see if patients are part of the same outbreak. 20
Identify the Problems Extract information valuable to TB control efforts in NYC beyond “match or no match” • Determine major phylogenetic lineages • Visualize genotype and patient information to find “clusters of interest” (outbreaks) • Spoligoforests • Patient/genotype clusters
Use M. Tuberculosis Complex DNA fingerprinting • Insertion sequence 6110 restriction fragment length polymorphism (IS6110-RFLP) • Polymorphic GC-rich sequence – RFLP • Spacer oligonucleotide typing (Spoligotyping) • Mycobacterial interspersed repetitive units (MIRU) • Single nucleotide polymorphism (SNP) • Large sequence polymorphism (LSP) • Spoligotyping + MIRU - • Routinely collected nationwide as part of TB surveillance data. New York City also has IS6110-RFLP Need to culture for several weeks PCR based
NYC Data from 2001-2008 • 4984 Patients • 137 Countries • 793 Spoligotypes, • 2648 RFLPs • 3235 Distinct Genotypes • 594 “Named” Clusters MIRU also available but incomplete
DNA Fingerprint: Spoligotyping • Direct repeats (DR) separated by variable spacers • Contiguous on chromosome, order well conserved • Forty three spacers used • Presence of a spacer is detected: 1- present ( ), 0 - absent ( ) spacer spacer DR DR spacer spacer DR DR DR DVR Strain Binary description of spoligotypes M. tuberculosis Beijing M. bovis
Major Genetic Lineages • Major genetic lineages as determined LSPs and SNPs widely accepted • How do you determine these based only on spoligotyes/MIRU? Lineages in NYC
Evolution of Spoligotypes • Mtb highly clonal. Evolution of spoligotypes is slow. • One or more contiguous spacers are lost in one evolutionary event. • Distinctphylogeographicgroups. • Dollo Parsimony Assumption: Once lost, spacers are never regained Strain Binary description of spoligotypes East Asian M. bovis Indo-Oceanic
TB-LinRules: DeterminesLineages Precise rules use two types of features Deletion of contiguous spacers MIRU Locus 24 If MIRU 24 >1 then ancestral otherwise modern. Beta Version: http://www.cs.rpi.edu/~bennek/tbtrack/ Refines and clarifies rules developed from literature analysis by Dr. L. Cowan of US CDC.
Genetic Diversity of TB in US Each node = Spoligotype Size = # of patients (log) Colors = 6 genetic lineages 37K patients
99.9% Rules Match Lineage on US CDC Database ~37K CDC Isolates Also tested on MIRU-VNTRPLUS.org Datasets with >99% accuracy
Genetic Diversity of TB in NYC NYC Isolates
Phylogeographic Distribution Modern Strains Ancestral Strains
Identification of Sub-families • Sub-families needed for further subgroup identification. • No complete deliniation of subfamilies exists so unsupervised or semi-supervised learning. • SPOTCLUST (Vitol, et al 2006) Generative Mixture Model
SPOTLCUST Hidden Parent Model captures one step of evolution One Hidden parent for each family Unknown Model infers hidden parents near leaves of tree without inferring phylogeny. Allow additional loss of spacers with low probability.
Subfamily Probability Model Visual Rule Generalizes to probability model Color represents probability spacer is on. Multivariate Bernoulli distribution: assumes each spacer is independent within a subfamily. M. africanum
Population is a mixture of subfamilies Bayesian Network Identifies 36 subfamilies using Spoligotypes
M.tuberculosis Haarlem2 Family • Prototype • Bernoulli mixture model without Hidden Parent • Bernoulli mixture model with Hidden Parent
The NYC M. bovis Mystery • Extra pulmonary M. bovis strikes • Mexican Immigrants • US-born children of Mexican Immigrants • Hypothesized caused: Unpasturized cheese
Conclusions • Machine Learning extracts critical information from public health databases. • Customized Machine Learning Methods for • Enhance drug discovery • Infectious disease tracking and control • Framing the right question is half the battle. • Need innovative solutions on wide range of learning tasks – (semi)(un)supervised, visualization • Goals are robust automated tools in the hands of front line users. see http://reccr.chem.rpi.edu http://www.cs.rpi.edu/~bennek/tbtrack