320 likes | 409 Views
A Randomized Exhaustive Propositionalization Approach for Molecule Classification. Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver. Drug discovery. Drug Discovery. The process of developing new drugs
E N D
A Randomized Exhaustive Propositionalization Approach for Molecule Classification Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver
Drug Discovery • The process of developing new drugs • The cost of developing a drug typically varies from 500 million $ to 2 billion $ • Molecule classification is used along the entire process to discriminate between: • Active and Non Active compounds • Toxic and Non Toxic compounds • During the development of a new drug: • Use the experiments done so far to train a classifier • Use the classifier to find the promising compounds to test next • An ideal Classification Algorithm: • Speeds up the design of new drugs • Gives insights about chemical properties
Data Mining in Drug Discovery The chemist designs a compound Non-Active! Classifier Non-Active (0) Active (1) Attribute representation
Molecule classification – Binary Fingerprints • One of the main attribute representations is the so called Binary Fingerprints: • Every attribute represents the absence/presence (0/1) of a characteristic or a substructure • The attributes are pre-defined characteristics • The classification process does not find new knowledge, but which attributes are most important
Molecule classification – Binary Fingerprints • The focus of this work is NOT on improving the classification procedure • It is on how to generate a good attribute representation that generates new knowledge
Propositionalization • The starting point is a database • By navigating through the database, new features are generated, which represent the result of SQL queries • These features are added to the mining table
It contains the compounds n 1 n n n 1 n
Generating a new attribute • Two steps: • Find a path that starts from the target table • Roll-up one simple attribute, through aggregations and refinements, from the last table to the target table
STEP 1: Find a path n 1 n n n 1 n
STEP 1: Find a path n 1 n n n This path will find attributes of depth 2 1 n Depth = measure of how complex the attribute is
STEP 2: Roll-up n 1 n n n Aggregate to each Atom: count distinct bonds (CDB) 1 n
STEP 2: Roll-up CDB 1 n 2 1 n n n Aggregate to each Atom: count distinct bonds (CDB) 1 n
STEP 2: Roll-up CDB 1 n 2 max(Atom.CDB) Where Atom.ele = ‘C’ 1 n X n 2 3 n Attach to the target table: The maximum number of bonds to which an atom of carbon participates 1 n
Propositionalization – graphically Depth 1 Depth 4 Depth 2 Depth 3
Our contribution over traditional propositionalization Our Randomized Exhaustive approach produces: More expressive attributes (Exhaustive) “Deeper” attributes (Randomized)
More expressive attributes – Example • Traditional propositionalization algorithms can generate the following attribute: • Count the number of double bonds to which each atom participates • Compute the maximum • But not the following attribute: • Count the number of double bonds to which each atom participates • Compute the maximumamong the oxygen atoms
Attributes Traditional vs Exhaustive Activity Mutagenicity
Design of the experiments • Given an attribute generation strategy: • Perform a 10-fold cross validation using 10 different classifiers (from Weka): • MultilayerPerceptron, BayesNet, Bagging, J48, ADTree, REPTree, RandomForest, PART, Nnge, Ridor • The average accuracy across the folds and across the classifiers is the measure of the performance of the strategy used
Up to a predefined depth In general, the deeper we go the higher accuracy we obtain Let’s generate attributes at depth > 4
Up to depth 4 + 1,000 in [5,7] Generate all attributes up to depth 4 and add 1,000 attributes randomly sampled from depth 5 to 7 77.93% 77.72% 76.30% 74.95%
Summary of the results • Exhaustive is significantly better than Traditional • Sampling deep attributes at the end of the attribute generation procedure is significantly better than continuing generating non-deep attributes • (in terms of proportion of classifiers that perform better with one strategy than with the other)
Comparison to fingerprints The difference is not significant
Comparison to fingerprints 2 hours of computing time Years of research effort in order to identify this attribute representation The difference is not significant
Additional Attribute Generation Strategies • Let’s not sample deep attributes randomly • Strategy 1: find the best mix of depths (scatter search) • Strategy 2: use a Bayesian Network to retrieve attributes with high information gain
New knowledge • Although our best method does not improve upon fingerprints, it has the potential of generating new knowledge • The attributes used by the classifiers represent important characteristics • Number of bromine atoms • The average number of double bonds among the atoms different from S These attributes identify structures that have characteristics that may prevent mutagenesis
New knowledge • But, sometimes, deep attributes are hard to interpret. • On Estrogen: • Label each atom A in the following way. 1) Consider the atoms connected to it and count the bonds to which they participate (excluding the bond connecting A to each of them). 2) Compute the sum of these labels and obtain the label for A. Label the molecule with the minimum of these labels across all atoms of oxygen. • Specifically, a high value would represent an oxygen atom that is connected to other atoms participating in a large number of additional bonds - presumably an oxygen atom that is somewhat buried and interacting with highly branched atoms.
Conclusions • The current attribute representations (Fingerprints) used for molecule classification does not provide insights on the chemical properties of the compounds • Traditional propositionalization approaches do not obtain satisfying accuracy • Our method extends the traditional propositionalization approach and: • Obtains an accuracy comparable to Fingerprints • Has the potential of finding new knowledge • Note that our method is applicable to any domain (marketing, medical, etc…)
Future Work • Accuracy improvement: • Scan&Sample => Scan & Smartly Sample • Improve the feature representation • Query-like is ok for computer scientists, but chemists would prefer a graphical representation
Thank you for your attention Michael.Samorani@Colorado.edu