1 / 27

________________________________________________

Analysis of High-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree. Jacqueline M. Hughes-Oliver Department of Statistics North Carolina State University hughesol@stat.ncsu.edu *joint with Ke Zhang, GSK and Stan Young, NISS.

minor
Download Presentation

________________________________________________

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis ofHigh-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree Jacqueline M. Hughes-Oliver Department of Statistics North Carolina State University hughesol@stat.ncsu.edu *joint with Ke Zhang, GSK and Stan Young, NISS ________________________________________________ This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG003900-01. Information on the Molecular Libraries Roadmap Initiative can be obtained from http://nihroadmap.nih.gov/molecularlibraries/

  2. Blackwell-Tapia - November 2008 Outline • Background • Recursive partitioning • OBSTree • Simulation study • Screening for monoamine oxidase inhibitors • Summary

  3. Blackwell-Tapia - November 2008 Background Estimate a function such that based on where Preferably, costs more than

  4. Blackwell-Tapia - November 2008 http://pubchem.ncbi.nlm.nih.gov/ http://www.niss.org/PowerMV/ http://eccr.stat.ncsu.edu/

  5. Blackwell-Tapia - November 2008 Background – Structure-Activity Relationship (SAR) • Willett, Barnard, Downs (1998 JCICS) • Molecular descriptors—Carhart atom pairs • Atom type—distance—atom type, e.g., C(2,1)-04-C(3,1) • Binary descriptors—few turned on

  6. Blackwell-Tapia - November 2008 Recursive Partitioning Splitting variable chosen to optimize “purity measure” Search space: size p X3=1 True False • Need definitions for: • search space • purity measure, splitting criterion • stopping criterion X27=1 True False

  7. Blackwell-Tapia - November 2008 17 18 3 9 19 9 11 0 0 6 3 15 6 2 12 0 0 0 0 0 0 13 1 16 5 5 0 2 0 0 0 1 0 0 8 4 8 0 0 0 0 7 1 0 3 0 3 Recursive Partitioning: Rules are complex • Are all splits necessary for the activity mechanism? • Does an early split impede identification of other mechanisms?

  8. Blackwell-Tapia - November 2008 Recursive Partitioning: Focus of Study Need definitions for: • Search space • Purity measure, splitting criterion • Stopping rule Binary Formal Inference-Based Recursive Modeling (BFIRM) • Cho, Shen, Hermsmeier (2000, JCICS) • Rank predictors according to F-test • Combine important predictors to form splitting variable • Result is better QSAR rules Recursive Partitioning/Simulated Annealing (RP/SA) • Blower et al. (2002, JCICS) • Best single predictor not necessarily best in combination Tree Harvesting • Yuan, Chipman, Welch (2006 tech report) • “Trim” bits off each terminal node

  9. Blackwell-Tapia - November 2008 Recursive Partitioning: RP/SA • Splitting variables are based on a combination of K predictors • Features are always present: • Search space of size • Uses simulated annealing – stochastic optimization • K is held fixed for all splits, and is assumed known

  10. Splitting variables are based on a combination of K predictors Combine approaches of BFIRM and RP/SA Features can be present or absent: chromosome selection Search space of size Uses simulated annealing + weighted sampling + trimming “K” can change for all splits, and is assumed unknown Uses a penalty entropy splitting criterion Usual stopping criteria applied, including cross validation Blackwell-Tapia - November 2008 OBSTree

  11. Blackwell-Tapia - November 2008 Descriptor Pool RP Singly Important Descriptors General Descriptors OBSTree: Flowchart Pre-OBSTree Setup • Remove unary descriptors • Determine Singly Important group • Specify parameters

  12. Blackwell-Tapia - November 2008 No SA to determine “optimal” (XA, xA) for split using WSS depth=d or node size<2minor Ymax=0 or Ybar>M-1 Form last terminal node. STOP Yes No Trim • Check 2K-1 subsets of current (XA, xA) • Report best trimmed version as (X*, x*) Yes X*=x*? Form terminal node OBSTree: Flowchart Pre-OBSTree Setup • Remove unary descriptors • Determine Singly Important group • Specify parameters • Initializesplit at next depth: • depth=depth+1 • a set of K descriptor (X0) using WSS • Determine • best chromosome x0of initial X0

  13. Blackwell-Tapia - November 2008 OBSTree: Splitting Criterion • Node has N compounds • Class ihas proportion pi in the node, with a total of ni in the node • Entropy (node impurity): • Penalty Entropy (penalize unwanted category) Problem: Entropy=0 (perfect) when a class of junk compounds is identified

  14. Blackwell-Tapia - November 2008 OBSTree: Stopping Criteria • Maximum depth d • The most active compound is junk • The node size is less than 2j (j is the minimum node size). • 5-fold cross-validation, e.g., choose depth d if • # correct classifications levels off at depth d • Accept H0: pd+1 = 0 for pd+1 = sensitivity between depths d and d+1

  15. Blackwell-Tapia - November 2008 Simulation Study • 1000 compounds, 500 binary descriptors • Four active groups (20 compounds per group) – 8% active

  16. Blackwell-Tapia - November 2008 17 18 3 9 19 9 11 0 0 6 3 15 6 2 12 0 0 0 0 0 0 13 1 16 5 5 0 2 0 0 0 1 0 0 8 4 8 0 0 0 0 7 1 0 3 0 3 Simulation Study: Standard RP Tree 5 compounds of 3 + 5 compounds of 0 7 compounds of 3

  17. Blackwell-Tapia - November 2008 Simulation Study: Sample OBSTree 1,2,3,4,5/1,0,1,0,1 3 15,16,17,18,19/1,1,0,1,1 1 5,6,7,8,9/0,1,1,1,1 3 3,11,12,13,17/1,1,1,1,1 0 2

  18. Blackwell-Tapia - November 2008 Simulation Study: 5-fold Cross-validation OBSTree RP

  19. Blackwell-Tapia - November 2008 Simulation Study: Sensitivity Analysis • K, descriptor set size • K >7 perfectly found all mechanisms • K =7 perfectly found all but one mechanism • Basic tree parameters • Min node size is 5 • SA parameters • Initial temperature • Minimum temperature • Temperature reduction rate • # transitions at a given temperature • # failures to accept new point before increasing transition counter • Sampling weights in WSS

  20. Blackwell-Tapia - November 2008 Screening to Identify MAO Inhibitors • Neuronal MAO deactivates neurotransmitters • Pargyline, an MAO inhibitor, was used to treat depression • MAO inhibitors no longer used due to toxicity & interactions • Abbott Laboratories dataset of MAO inhibitors • Brown & Martin (1996 JCICS), • 1646 chemically diverse compounds • 1380 binary 2D atom-pair descriptors • Response variable – 0, 1, 2, 3 (ordered data) [1358/114/86/88] • Category 3 has 2 well-known mechanisms - Rusinko et al. (1999 JCICS)

  21. Blackwell-Tapia - November 2008 184,721,879 81,177,579,183/1,1,1,0 0/0/0/33 0/1/0/6 32,572,844 1, 579,1184,809/1,1,1,0 0/0/0/15 1/0/1/26 OBSTree RP/SA 704 1184 81 9/2/1/2 2/0/0/32 65 2/0/5/24 183 959/85/55/18 99/1/0/0 RP

  22. Blackwell-Tapia - November 2008 MAO: Activity Mechanism I • “Irreversible binding to flavin cofactor of MAO” • Pargyline-like compounds • Typical features of pargyline-like compounds • A triple bond • A tertiary nitrogen • An aromatic ring • 1st terminal node of OBSTree • Highest active terminal node of RP • 1st terminal node of RP/SA

  23. Blackwell-Tapia - November 2008 MAO: Activity Mechanism I • Compound 1: Pargyline, y=3, has 579 & 81 & 177 but not 183 • Compound 2: y=0, has feature 183 so violates OBSTree • Compound 3: y=0, falls in active node from RP • Compound 4: y=0, falls in active node from RP and RP/SA

  24. Blackwell-Tapia - November 2008 579: C(2,1)-3-C(3,1) 1:C(1,0)-3-C(1,0) 1184:N(2,0)-2-N(2,0) ) C(1,0)-3-C(1,0) N HO N O MAO: Activity Mechanism II • “Binding to active site" • –N-N-C(=O)- is a hydrazine feature that can be hydrolyzed to bind protein (MAO) as a nonselective, irreversible inhibitor

  25. Blackwell-Tapia - November 2008 Br Br C(3,1)-4-Br C(3,1)-4-Br N N O O N N Activity=3 Activity=2 Absent Descriptor (809: C(3,1)-4-Br)

  26. Blackwell-Tapia - November 2008 Summary • OBSTree: new RP algorithm for obtaining simplified output • Model presence and absence of molecular features • Combination size is data-driven, varies over splits • Penalty entropy splitting criterion for one-sided purity • Weighted sampling during optimization allows prior information • Simpler verification of QSAR • Standard RP and RP/SA are special cases of OBSTree • Output is not deterministic • As with any RP output, care should be taken when interpreting the results • Can miss highly correlated but important predictors • Different trees provide similar partitions of the data • Because of hard thresholding, predictions are highly variable • Computationally intensive!

  27. Blackwell-Tapia - November 2008 Acknowledgements • Atina Brooks, North Carolina State University • Jiajun Liu, Merck • Haojun Ouyang, North Carolina State University • Abbott Laboratories • Jack Liu, OmicSoft • Jun Feng, NIH • GoldenHelix

More Related