630 likes | 690 Views
Mining Imprecise Discriminative Molecular Fragments. Michael R. Berthold Data Analysis Research Group, Tripos, Inc. South San Francisco, California, USA and ALTANA-Chair for Bioinformatics and Information Mining University of Konstanz, Germany Email: berthold@inf.uni-konstanz.de. Outline.
E N D
Mining ImpreciseDiscriminativeMolecular Fragments Michael R. Berthold Data Analysis Research Group, Tripos, Inc. South San Francisco, California, USAandALTANA-Chair for Bioinformatics and Information Mining University of Konstanz, Germany Email: berthold@inf.uni-konstanz.de
Outline • Motivation: Imprecise Data in BioInformatics • Drug Discovery and High Throughput Screening • Finding Clusters in Chemistry Space • Synergies: Clever Algorithms and Domain Knowledge • Mining Molecular Fragments • What the Chemist really wants: Imprecision(Fuzzy Atoms and flexible Chains) • Some Experimental Results (NCI HIV screens) • Conclusions
Drug Discovery… Classic: • Expert Knowledge available: • Metabolic pathway information • Binding site information • After Specific Target is identified: • Generate Assay to identify desirable effect • Assemble & Test (focused) library of compounds • First Phase: High Throughput Screening (HTS)Often hundreds of thousands of molecules tested in highly automated fashion • …After clever data analysis… • Second Phase: Test a few hundred compounds more carefully (IC50)
…Drug Discovery • And then (in the remaining 8-9 years): • Animal Testing • Several rounds of clinical testing • Approval procedures • And most often: late stage failure • Go back to start, do not collect $1,000,000,000 • Lead Rescue: eliminate side effects (ADME/Tox, cardiac effects, sometimes also avoid patents…) avoid bad areas in “drug space” (lead hopping)
High Throughput Screening Rapidly screen 100-thousand’s of candidates. • Problems • Often thousands of actives • Data extremely noisy(up to 50% false positives, unknown false negatives!) • Positives almost always active for different reasons Separate, diverse clusters! Goal:Find common properties among similar subsets of active molecules(help user understand activity patterns!)
Motivation • Goal:Find (and describe!) structural groups of molecules that share activity. • For few molecules, manual inspection is feasible.
Motivation • Goal:Find (and describe!) structural groups of molecules that share activity. • For few molecules, manual inspection is feasible. • For more molecules, automated methods are needed…
Molecular Fragment Miner (MoFa) [Ch. Borgelt, M. R. Berthold, IEEE Data Mining, 2002.] Goal: • Find Fragments that are discriminative for a class of interest (high activity, good synthesis result, …): • Appear often in Positives: freq(high activity)>threshold • Appear rarely in Negatives: freq(low activity)<delta MoFa: • Based on Market Basket Analysis (Eclat Algorithm) • Grow Fragment-Candidates from scratch atom-by-atom • Only report significant and unique fragments
Example • 6 Example “Molecules” • Find all unique fragments that occur in 4 Molecules O O O = = = _ _ _ _ _ _ = C C S N C C S N C S N = _ = _ O C C N N N = _ = _ _ _ _ _ _ = C C S N C S N C S O
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C Examples: (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C C N O S Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=3 #=6 #=4 #=4 #=4 #=6 #=4 #=4 #=4 C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=3 #=6 #=4 #=4 #=4 #=6 #=4 #=4 #=4 C-C C-S N=S N-S O=S S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
a b c a,b a,c b,c Duplicate Fragments! • How do Apriori, Eclat & Co avoid Duplicate Itemsets? Prefix Tree a,b,c BUT: Prefix Tree requires a global order defined on items…
Local Order on Atoms/Bonds • Global order on atoms/bonds is not possible • Use local order on atoms: C < N < O < S • In case of same atom type, use secondary order based on bond: single (-) < aromatic < double (=) < triple • Higher (or equal) extensions are only allowed on last atom extended, and • All extensions are allowed on atoms inserted after last atom extended.
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N | C S-N || N S-N || O S=N | C S=N | N S=O | C S=O | N S=O || N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N S-C-C S-C | N S-C || N S-C || O S=N || O S-N || N S-N || O Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=3 #=4 #=4 #=4 #=2 #=2 #=2 S-C-C S-C | N S-C || N S-C || O S-N || N S-N || O S=N || O Examples : (a) (b) (c) (d) (e) (f)
Support Based Pruning • Support of fragment A:supp(A) = Frequency of appearance in molecules • Monotone conditions decline with size of fragment:fragment A is contained in fragment B supp(A) supp(B) • If supp(node) in branch is below thresholdthen all child-nodes will also be below threshold.
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=3 #=4 #=4 #=4 #=2 #=2 #=2 S-C-C S-C | N S-C || N S-C || O S-N || N S-N || O S=N || O Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=6 #=4 #=6 #=6 C N O S #=6 #=4 #=4 #=4 S-C S-N S=O S=N #=4 #=4 #=4 S-C | N S-C || N S-C || O Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C Resulting Fragments for supp(A)4: #=4 #=4 #=6 #=4 S-C || O S-C || N C-S C-S-N Examples : (a) (b) (c) (d) (e) (f)
N N N O O O = = = _ = = _ _ _ _ _ _ _ _ _ _ _ _ = = C C S N C C S N C S N = C C S N C S N C S O _ = _ O C C #=4 Some fragments which are not reported (due to redundant support): #=6 #=6 S || N C S Resulting Fragments for supp(A)4: #=4 #=4 #=6 #=4 S-C || O S-C || N C-S C-S-N Examples : (a) (b) (c) (d) (e) (f)
Discriminative Fragments • Just finding frequent fragments usually not interesting • Find fragments that are • frequent in one class of molecules • and infrequent in the remainder of molecules • Discriminative Fragments summarize shared properties. • Number of actives and inactives (and the ratio) that contain fragment indicates relevance.
Example:[NCI HIV dataset ~45000 (~400 active) compounds, threshold=15%] ….. 15.08% vs. 0.02%
A few more fragments… 5.23% vs. 0.05% 5.23% vs. 0.08% 4.92 vs. 0.07% 9.85% vs. 0.07% 9.85% vs. 0.0% 10.15% vs. 0.04%
Two of the underlying molecules: Problems… However, some fragments puzzled our chemists…
Chemists’ view • Strict graph-based view of molecules is too restrictive • Some tolerances do not affect function, e.g.: • In a specific context, some atoms may be of different type(e.g. N/C equivalence in aromatic rings, all halogens are equivalent, …) • The exact length of a chain connecting two rigid substructures does not matter(e.g. chains of CH2 can be 2-4 carbons long, …)
Fuzzy Matches[H. Hofer, Ch. Borgelt, M. R. Berthold, IDA, Berlin, 2003] Specifying wildcards via equivalence classes, here • Meta Atoms: Certain atoms can be matched • Maximum number of fuzzy-atoms allowed • Equivalence classes can overlap (e.g. {O,C} and {C,N}) • Fuzzy Chains: Model flexible chains explicitly • Specify min/max length of chains
Fuzzy Matches- Fuzzy Atom Matches (HIV data): Cl N {O,N} S Cl N {O,S} S O N O S CA 5.5% 3.7% CA 5.5% 0.01% CI 0.0% 0.0% CI 0.0% 0.0%
MoFa - Summary • Search based on parallel embeddingsand large scale data mining algorithm (Apriori/Eclat) • Computationally very efficient • Discovered knowledge is immediately meaningful • Fragments understandable to chemist • Better than rules/decision trees on mystic attributes • Really useful after incorporating Expert Feedback re. Imprecisions: • Markush structures: allow for wildcards in fragments(fuzzy atoms and chains of flexible length) • Applied successfully to HTS data analysis, chemical synthesis success prediction.
Thank you. Preprints/Remarks/further Questions:send eMail toberthold@inf.uni-konstanz.de
Conclusions… • Data Analysis in Life Sciences is inherently: • multi-disciplinary • Imprecise • Interactive • context-dependent notions of similarity • Focus is not exclusively on building good predictors • Instead the user wants understandable pieces of knowledge (“Information Mining”). • Value of knowledge depends on archival… • Store&Retrieve past “experience” • … and on usability
What is “Similarity”? Tropacocaine 1518-12246 – Local Anesthetic
Types of Molecular Similarity • Structural similarity: • Same basic layout of overall graph • …or at least existence of a common subgraph • Geometrical similarity: • Roughly same shape in 3D, independent of exact atom matches • Instead of simple shape, also other properties (surface charge…) can be compared • Global properties: • Molecular weight • Number of hitrogen donors/acceptors… • And many others…
Knowledge Recycling Hardly ever do we find precise fits • find similar structures • chemical similarity • activity related similarity • … • determine related context • cardiac effects vs. ion channel effects (hERG assay) • appear in same metabolic pathway • Related gene expression profiles • … • and finally draw (inherently imprecise!) inferences Knowledge Archival, Management and Usability are crucial.