1 / 48

Outline

Outline. Motivation: Imprecise Data in BioInformatics Drug Discovery and High Throughput Screening Finding Clusters in Chemistry Space Synergies: Clever Algorithms and Domain Knowledge Mining Molecular Fragments What the Chemist really wants: Imprecision (Fuzzy Atoms and flexible Chains)

zihna
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Motivation: Imprecise Data in BioInformatics • Drug Discovery and High Throughput Screening • Finding Clusters in Chemistry Space • Synergies: Clever Algorithms and Domain Knowledge • Mining Molecular Fragments • What the Chemist really wants: Imprecision(Fuzzy Atoms and flexible Chains) • Some Experimental Results (NCI HIV screens) • Conclusions&Outlook: Learning from Past Experience • Bits&Pieces of Evidence • Storing and Retrieving Knowledge

  2. Drug Discovery… Classic: • Expert Knowledge available: • Metabolic pathway information • Binding site information • After Specific Target is identified: • Generate Assay to identify desirable effect • Assemble & Test (focused) library of compounds • First Phase: High Throughput Screening (HTS)Often hundreds of thousands of molecules tested in highly automated fashion • …After clever data analysis… • Second Phase: Test a few hundred compounds more carefully (IC50)

  3. …Drug Discovery • And then (in the remaining 8-9 years): • Animal Testing • Several rounds of clinical testing • Approval procedures • And most often: late stage failure • Go back to start, do not collect $1,000,000,000 • Lead Rescue: eliminate side effects (ADME/Tox, cardiac effects, sometimes also avoid patents…) avoid bad areas in “drug space” (lead hopping)

  4. High Throughput Screening Rapidly screen 100-thousand’s of candidates. • Problems • Often thousands of actives • Data extremely noisy(up to 50% false positives, unknown false negatives!) • Positives almost always active for different reasons Separate, diverse clusters!  Goal:Find common properties among similar subsets of active molecules(help user understand activity patterns!)

  5. What is “Similarity”? Tropacocaine 1518-12246 – Local Anesthetic

  6. Types of Similarity • Structural similarity: • Same basic layout of overall graph • …or at least existence of a common subgraph • Geometrical similarity: • Roughly same shape in 3D, independent of exact atom matches • Instead of simple shape, also other properties (surface charge…) can be compared • Global properties: • Molecular weight • Number of hitrogen donors/acceptors… • And many others…

  7. Motivation • Goal:Find (and describe!) structural groups of molecules that share activity. • For few molecules, manual inspection is feasible.

  8. Motivation • Goal:Find (and describe!) structural groups of molecules that share activity. • For few molecules, manual inspection is feasible. • For more molecules, automated methods are needed…

  9. Motivation

  10. Motivation

  11. Motivation

  12. Molecular Fragment Miner (MoFa) [Ch. Borgelt, M. R. Berthold, IEEE Data Mining, 2002.] Goal: • Find Fragments that are discriminative for a class of interest (high activity, good synthesis result, …): • Appear often in Positives: freq(high activity)>threshold • Appear rarely in Negatives: freq(low activity)<delta MoFa: • Based on Market Basket Analysis (Eclat Algorithm) • Grow Fragment-Candidates from scratch atom-by-atom • Only report significant and unique fragments

  13. Example • 6 Example “Molecules” • Find all unique fragments that occur in  4 Molecules O O O = = = _ _ _ _ _ _ = C C S N C C S N C S N = _ = _ O C C N N N = _ = _ _ _ _ _ _ = C C S N C S N C S O

  14. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C Examples: (a) (b) (c) (d) (e) (f)

  15. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C C N O S Examples : (a) (b) (c) (d) (e) (f)

  16. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S Examples : (a) (b) (c) (d) (e) (f)

  17. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)

  18. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=3 #=6 #=4 #=4 #=4 #=6 #=4 #=4 #=4 C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)

  19. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=3 #=6 #=4 #=4 #=4 #=6 #=4 #=4 #=4 C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)

  20. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)

  21. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)

  22. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)

  23. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)

  24. a b c a,b a,c b,c Duplicate Fragments! • How do Apriori, Eclat & Co avoid Duplicate Itemsets? Prefix Tree a,b,c BUT: Prefix Tree requires a global order defined on items…

  25. Local Order on Atoms/Bonds • Global order on atoms/bonds is not possible • Use local order on atoms: C < N < O < S • In case of same atom type, use secondary order based on bond: single (-) < aromatic < double (=) < triple • Higher (or equal) extensions are only allowed on last atom extended, and • All extensions are allowed on atoms inserted after last atom extended.

  26. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)

  27. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N || N S-N || O Examples : (a) (b) (c) (d) (e) (f)

  28. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=3 #=4 #=4 #=4 #=2 #=2 #=2 S-C-C S-C | N S-C || N S-C || O S-N || N S-N || O S=N || O Examples : (a) (b) (c) (d) (e) (f)

  29. Support Based Pruning • Support of fragment A:supp(A) = Frequency of appearance in molecules • Monotone conditions decline with size of fragment:fragment A is contained in fragment B supp(A)  supp(B) • If supp(node) in branch is below thresholdthen all child-nodes will also be below threshold.

  30. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=3 #=4 #=4 #=4 #=2 #=2 #=2 S-C-C S-C | N S-C || N S-C || O S-N || N S-N || O S=N || O Examples : (a) (b) (c) (d) (e) (f)

  31. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=4 #=4 #=4 S-C | N S-C || N S-C || O Examples : (a) (b) (c) (d) (e) (f)

  32. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C Resulting Fragments for supp(A)4: #=4 #=4 #=6 #=4 S-C || O S-C || N C-S C-S-N Examples : (a) (b) (c) (d) (e) (f)

  33. N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=4 Some fragments which are not reported (due to redundant support): #=6 #=6 S || N C S Resulting Fragments for supp(A)4: #=4 #=4 #=6 #=4 S-C || O S-C || N C-S C-S-N Examples : (a) (b) (c) (d) (e) (f)

  34. Discriminative Fragments • Just finding frequent fragments usually not interesting • Find fragments that are • frequent in one class of molecules • and infrequent in the remainder of molecules • Discriminative Fragments summarize shared properties. • Number of actives and inactives (and the ratio) that contain fragment indicates relevance.

  35. Example:[NCI HIV dataset ~45000 (~400 active) compounds, threshold=15%] ….. 15.08% vs. 0.02%

  36. A few more fragments… 5.23% vs. 0.05% 5.23% vs. 0.08% 4.92 vs. 0.07% 9.85% vs. 0.07% 9.85% vs. 0.0% 10.15% vs. 0.04%

  37. Two of the underlying molecules: Problems… However, some fragments puzzled our chemists…

  38. Small Differences…

  39. Chemists’ view • Strict graph-based view of molecules is too restrictive • Some tolerances do not affect function, e.g.: • In a specific context, some atoms may be of different type(e.g. N/C equivalence in aromatic rings, all halogens are equivalent, …) • The exact length of a chain connecting two rigid substructures does not matter(e.g. chains of CH2 can be 2-4 carbons long, …)

  40. Fuzzy Matches[H. Hofer, Ch. Borgelt, M. R. Berthold, IDA, Berlin, 2003] Specifying wildcards via equivalence classes, here • Meta Atoms: Certain atoms can be matched • Maximum number of fuzzy-atoms allowed • Equivalence classes can overlap (e.g. {O,C} and {C,N}) • Fuzzy Chains: Model flexible chains explicitly • Specify min/max length of chains

  41. Fuzzy Atoms and Chains

  42. MoFa - Summary • Search based on parallel embeddingsand large scale data mining algorithm (Apriori/Eclat) • Computationally very efficient • Discovered knowledge is immediately meaningful • Fragments understandable to chemist • Better than rules/decision trees on mystic attributes • Really useful after incorporating Expert Feedback: • Markush structures: allow for wildcards in fragments(fuzzy atoms and chains of flexible length) • Applied successfully to HTS data analysis, chemical synthesis success prediction.

  43. Conclusions… • Data Analysis in Life Sciences is inherently: • multi-disciplinary • Imprecise • Interactive • context-dependent notions of similarity • Focus is not exclusively on building good predictors • Instead the user wants understandable pieces of knowledge (“Information Mining”). • Value of knowledge depends on archival… • Store&Retrieve past “experience” • … and on usability

  44. Knowledge Recycling HIV activityrelatedfragments New Structure:Good Candidate?

  45. Knowledge Recycling SynthesisSuccessfragments HIV activityrelatedfragments Metabolic PathwayInformation 1000 moleculeion channel side effects(hERG assay) 5000 moleculeRat LiverToxicity tests New Structure:Good Candidate? Rat Cancer CoMFA model Cluster model(3D similarity)kidneyside effects Gene Expressiondata for other diseased cells CompetitorsPatent Space

  46. Knowledge Recycling Hardly ever do we find precise fits • find similar structures • chemical similarity • activity related similarity • … • determine related context • cardiac effects vs. ion channel effects (hERG assay) • appear in same metabolic pathway • Related gene expression profiles • … • and finally draw (inherently imprecise!) inferences Knowledge Archival, Management and Usability are crucial.

  47. Thank you. Preprints/Remarks/further Questions:send eMail toberthold@ieee.org

More Related