270 likes | 402 Views
Analysis of High-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree. Jacqueline M. Hughes-Oliver Department of Statistics North Carolina State University hughesol@stat.ncsu.edu *joint with Ke Zhang, GSK and Stan Young, NISS.
E N D
Analysis ofHigh-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree Jacqueline M. Hughes-Oliver Department of Statistics North Carolina State University hughesol@stat.ncsu.edu *joint with Ke Zhang, GSK and Stan Young, NISS ________________________________________________ This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG003900-01. Information on the Molecular Libraries Roadmap Initiative can be obtained from http://nihroadmap.nih.gov/molecularlibraries/
Blackwell-Tapia - November 2008 Outline • Background • Recursive partitioning • OBSTree • Simulation study • Screening for monoamine oxidase inhibitors • Summary
Blackwell-Tapia - November 2008 Background Estimate a function such that based on where Preferably, costs more than
Blackwell-Tapia - November 2008 http://pubchem.ncbi.nlm.nih.gov/ http://www.niss.org/PowerMV/ http://eccr.stat.ncsu.edu/
Blackwell-Tapia - November 2008 Background – Structure-Activity Relationship (SAR) • Willett, Barnard, Downs (1998 JCICS) • Molecular descriptors—Carhart atom pairs • Atom type—distance—atom type, e.g., C(2,1)-04-C(3,1) • Binary descriptors—few turned on
Blackwell-Tapia - November 2008 Recursive Partitioning Splitting variable chosen to optimize “purity measure” Search space: size p X3=1 True False • Need definitions for: • search space • purity measure, splitting criterion • stopping criterion X27=1 True False
Blackwell-Tapia - November 2008 17 18 3 9 19 9 11 0 0 6 3 15 6 2 12 0 0 0 0 0 0 13 1 16 5 5 0 2 0 0 0 1 0 0 8 4 8 0 0 0 0 7 1 0 3 0 3 Recursive Partitioning: Rules are complex • Are all splits necessary for the activity mechanism? • Does an early split impede identification of other mechanisms?
Blackwell-Tapia - November 2008 Recursive Partitioning: Focus of Study Need definitions for: • Search space • Purity measure, splitting criterion • Stopping rule Binary Formal Inference-Based Recursive Modeling (BFIRM) • Cho, Shen, Hermsmeier (2000, JCICS) • Rank predictors according to F-test • Combine important predictors to form splitting variable • Result is better QSAR rules Recursive Partitioning/Simulated Annealing (RP/SA) • Blower et al. (2002, JCICS) • Best single predictor not necessarily best in combination Tree Harvesting • Yuan, Chipman, Welch (2006 tech report) • “Trim” bits off each terminal node
Blackwell-Tapia - November 2008 Recursive Partitioning: RP/SA • Splitting variables are based on a combination of K predictors • Features are always present: • Search space of size • Uses simulated annealing – stochastic optimization • K is held fixed for all splits, and is assumed known
Splitting variables are based on a combination of K predictors Combine approaches of BFIRM and RP/SA Features can be present or absent: chromosome selection Search space of size Uses simulated annealing + weighted sampling + trimming “K” can change for all splits, and is assumed unknown Uses a penalty entropy splitting criterion Usual stopping criteria applied, including cross validation Blackwell-Tapia - November 2008 OBSTree
Blackwell-Tapia - November 2008 Descriptor Pool RP Singly Important Descriptors General Descriptors OBSTree: Flowchart Pre-OBSTree Setup • Remove unary descriptors • Determine Singly Important group • Specify parameters
Blackwell-Tapia - November 2008 No SA to determine “optimal” (XA, xA) for split using WSS depth=d or node size<2minor Ymax=0 or Ybar>M-1 Form last terminal node. STOP Yes No Trim • Check 2K-1 subsets of current (XA, xA) • Report best trimmed version as (X*, x*) Yes X*=x*? Form terminal node OBSTree: Flowchart Pre-OBSTree Setup • Remove unary descriptors • Determine Singly Important group • Specify parameters • Initializesplit at next depth: • depth=depth+1 • a set of K descriptor (X0) using WSS • Determine • best chromosome x0of initial X0
Blackwell-Tapia - November 2008 OBSTree: Splitting Criterion • Node has N compounds • Class ihas proportion pi in the node, with a total of ni in the node • Entropy (node impurity): • Penalty Entropy (penalize unwanted category) Problem: Entropy=0 (perfect) when a class of junk compounds is identified
Blackwell-Tapia - November 2008 OBSTree: Stopping Criteria • Maximum depth d • The most active compound is junk • The node size is less than 2j (j is the minimum node size). • 5-fold cross-validation, e.g., choose depth d if • # correct classifications levels off at depth d • Accept H0: pd+1 = 0 for pd+1 = sensitivity between depths d and d+1
Blackwell-Tapia - November 2008 Simulation Study • 1000 compounds, 500 binary descriptors • Four active groups (20 compounds per group) – 8% active
Blackwell-Tapia - November 2008 17 18 3 9 19 9 11 0 0 6 3 15 6 2 12 0 0 0 0 0 0 13 1 16 5 5 0 2 0 0 0 1 0 0 8 4 8 0 0 0 0 7 1 0 3 0 3 Simulation Study: Standard RP Tree 5 compounds of 3 + 5 compounds of 0 7 compounds of 3
Blackwell-Tapia - November 2008 Simulation Study: Sample OBSTree 1,2,3,4,5/1,0,1,0,1 3 15,16,17,18,19/1,1,0,1,1 1 5,6,7,8,9/0,1,1,1,1 3 3,11,12,13,17/1,1,1,1,1 0 2
Blackwell-Tapia - November 2008 Simulation Study: 5-fold Cross-validation OBSTree RP
Blackwell-Tapia - November 2008 Simulation Study: Sensitivity Analysis • K, descriptor set size • K >7 perfectly found all mechanisms • K =7 perfectly found all but one mechanism • Basic tree parameters • Min node size is 5 • SA parameters • Initial temperature • Minimum temperature • Temperature reduction rate • # transitions at a given temperature • # failures to accept new point before increasing transition counter • Sampling weights in WSS
Blackwell-Tapia - November 2008 Screening to Identify MAO Inhibitors • Neuronal MAO deactivates neurotransmitters • Pargyline, an MAO inhibitor, was used to treat depression • MAO inhibitors no longer used due to toxicity & interactions • Abbott Laboratories dataset of MAO inhibitors • Brown & Martin (1996 JCICS), • 1646 chemically diverse compounds • 1380 binary 2D atom-pair descriptors • Response variable – 0, 1, 2, 3 (ordered data) [1358/114/86/88] • Category 3 has 2 well-known mechanisms - Rusinko et al. (1999 JCICS)
Blackwell-Tapia - November 2008 184,721,879 81,177,579,183/1,1,1,0 0/0/0/33 0/1/0/6 32,572,844 1, 579,1184,809/1,1,1,0 0/0/0/15 1/0/1/26 OBSTree RP/SA 704 1184 81 9/2/1/2 2/0/0/32 65 2/0/5/24 183 959/85/55/18 99/1/0/0 RP
Blackwell-Tapia - November 2008 MAO: Activity Mechanism I • “Irreversible binding to flavin cofactor of MAO” • Pargyline-like compounds • Typical features of pargyline-like compounds • A triple bond • A tertiary nitrogen • An aromatic ring • 1st terminal node of OBSTree • Highest active terminal node of RP • 1st terminal node of RP/SA
Blackwell-Tapia - November 2008 MAO: Activity Mechanism I • Compound 1: Pargyline, y=3, has 579 & 81 & 177 but not 183 • Compound 2: y=0, has feature 183 so violates OBSTree • Compound 3: y=0, falls in active node from RP • Compound 4: y=0, falls in active node from RP and RP/SA
Blackwell-Tapia - November 2008 579: C(2,1)-3-C(3,1) 1:C(1,0)-3-C(1,0) 1184:N(2,0)-2-N(2,0) ) C(1,0)-3-C(1,0) N HO N O MAO: Activity Mechanism II • “Binding to active site" • –N-N-C(=O)- is a hydrazine feature that can be hydrolyzed to bind protein (MAO) as a nonselective, irreversible inhibitor
Blackwell-Tapia - November 2008 Br Br C(3,1)-4-Br C(3,1)-4-Br N N O O N N Activity=3 Activity=2 Absent Descriptor (809: C(3,1)-4-Br)
Blackwell-Tapia - November 2008 Summary • OBSTree: new RP algorithm for obtaining simplified output • Model presence and absence of molecular features • Combination size is data-driven, varies over splits • Penalty entropy splitting criterion for one-sided purity • Weighted sampling during optimization allows prior information • Simpler verification of QSAR • Standard RP and RP/SA are special cases of OBSTree • Output is not deterministic • As with any RP output, care should be taken when interpreting the results • Can miss highly correlated but important predictors • Different trees provide similar partitions of the data • Because of hard thresholding, predictions are highly variable • Computationally intensive!
Blackwell-Tapia - November 2008 Acknowledgements • Atina Brooks, North Carolina State University • Jiajun Liu, Merck • Haojun Ouyang, North Carolina State University • Abbott Laboratories • Jack Liu, OmicSoft • Jun Feng, NIH • GoldenHelix