280 likes | 350 Views
Discovering Substructures in Chemical Toxicity Domain. Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn Peterson Department of Computer Science and Engineering University of Texas at Arlington. Outline.
E N D
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn Peterson Department of Computer Science and Engineering University of Texas at Arlington
Outline • Chemical Toxicity Database • Motivation and Goal • Knowledge Discovery in Databases (KDD) • SUBDUE Knowledge Discovery System • Experiments with Unsupervised SUBDUE • Experiments with Supervised SUBDUE • Discussion of Results • Conclusions • Future Work
Chemical Toxicity Database • Carcinogenesis Prediction Problem • Toxicology Evaluation Challenge • Domain: Compounds + - Total Training set 162 136 298 Experimental set 27 25 69
Motivation and Goal • Ever-increasing number of chemical compounds • Needs analysis to obtain the Structure-Activity relationships of a compound • Determine SUBDUE’s applicability to chemical toxicity domain
Knowledge Discovery in Databases (KDD) • Process of identifying valid, novel, potentially useful and understandable patterns in data • Goal of Knowledge Discovery: Verification Discovery • Data mining methods • Model Representation, Evaluation and Search
Steps in KDD • Identify the goal of the process • Collect, create and prepare the dataset • Select the data mining method • Select the data mining algorithm • Transform the data • Execute the algorithm • Interpret/evaluate the discovered patterns • Consolidate the knowledge discovered
SUBDUE Knowledge Discovery System • SUBDUE discovers patterns [substructures] in structural data sets Vertices: objects or attributes Edges: relationships shape triangle object shape on square object 4 instances of
SUBDUE - Input Representation • Each atom is represented as a vertex with directed edges to the name, type and the partial charge of the atom • Bonds are represented as undirected edges • Each group is represented as a vertex having a string label specifying the group name with directed edges to all participating atom vertices
SUBDUE - Input Representation • Representation used in Unsupervised SUBDUE A vertex having a string label specifying the alert with directed edges to all the atoms in the compound • Representation used in Supervised SUBDUE A vertex for all the compounds with string label compound The compound vertex has directed edges to all the vertices representing the activity of an alert on a compound
Unsupervised SUBDUE Input Representation Example C 10 10 0.063 C 0.062 t n t p n p Atom Atom 1 gr n - Name t - Type p - Partial charge po - Positive gr - group po po gr Ames Methyl
Supervised SUBDUE Input Representation Example C 10 10 0.063 C 0.062 t n t p n p Atom Atom 1 gr contains n - Name t - Type p - Partial charge gr - group Com - Compound gr contains Com Methyl Positive Ames
SUBDUE - Model Evaluation • Minimum Description Length Principle Best theory to describe any graph Minimize I(S) + I(G/S) • Graph Compression
Other important Concepts of SUBDUE • Inexact Graph Match Approach • Concept - Learning • Predefined Substructures
Unsupervised SUBDUE - Methodology • Training set further divided • 3 approaches to determine carcinogenicity of compounds in experimental set -- Apply SUBDUE individually to the compounds -- Inclusion of pre-defined substructures -- Check for matching of substructure in the compound to be classified
Unsupervised SUBDUE - Results 10 3 0.062 0.057 c br t p t p n n atom atom 1 • Third approach used to classify compounds in experimental set • Accuracy Level -> 0.322 • Cyanate & ether groups are also discovered to be indicators of carcinogenic activity
Supervised SUBDUE - Methodology • Create set of indicators of carcinogenic activity • Create set of indicators of noncarcinogenic activity • Calculate value of substructures discovered in carcinogenic and noncarcinogenic set • Select a set of substructures to be used in classifying compounds in experimental set
Supervised SUBDUE - Methodology • Check for the existence of these substructures in the compound to be classified • Calculate the Carcinogenic Activity Value of the compound • Calculate the NonCarcinogenic Activity Value of the compound • Determine the activity of the compound
Supervised SUBDUE - Results • A set of 12 substructures discovered by SUBDUE used to classify compounds in the experimental set • 6 substructures from carcinogenic set include substructures which form part of groups like amino, di10, methyl, ether, halide10 and substructure which indicates compound testing positive on AMES, Salmonella, etc. • 6 substructures from noncarcinogenic set include substructures which form part of groups like methoxy, Ar_Halide, di64, nitro and alkyl_halide and substructure which indicates compound testing negative on AMES, Salmonella, etc.
Supervised SUBDUE - Substructure Example - Carcinogenic Set positive Ames Salmonella positive Compound Salmonella_n positive
Supervised SUBDUE - Substructure Example - Carcinogenic Set Cl 93 10 -0.123 C n -0.024 t t p n Atom p Atom n - Name t - Type p - Partial charge gr - group gr gr Halide10
Supervised SUBDUE - Substructure Example - NonCarcinogenic Set negative Ames Salmonella negative Compound Cytogen_ca negative
Supervised SUBDUE - Substructure Example - NonCarcinogenic Set Cl 93 10 -0.124 C 0.477 n t t p n p Atom Atom n - Name t - Type p - Partial charge gr - group A-H - Alkyl Halide gr gr A-H
Supervised SUBDUE - Results • PTE-1 Results: Compounds + - Total PTE-1 20 19 39 Correct Prediction 12 6 18 Incorrect Prediction 8 13 22 • Accuracy: 0.6 (+ ), 0.315 (-) , 0.462 (total)
Supervised SUBDUE - Results • PTE-2 Results: Compounds + - Total PTE-2 7 6 13* Correct Prediction 4 3 7 Incorrect Prediction 3 3 6 * : # of compounds whose activity is known • Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 (total)
Results - Discussion • Unsupervised SUBDUE successful in discovering lead indicators of carcinogenic activity • Supervised SUBDUE also successful in discovering lead indicators of carcinogenic activity • ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62) • Ashby, TOPKAT are other toxicity prediction methods
Conclusions • Consistent with results obtained by logic based systems like PROGOL • Prefer to use Concept Learner when positive and negative examples of target concept available • SUBDUE is capable of discovering lead indicators of carcinogenic/noncarcinogenic activity in chemical toxicity domain .
Future Work • PTE-3 Evaluation Challenge • Trimmed Data Sets (Partial Charge) • Newer Version of Concept Learning SUBDUE being developed
Reference http://cygnus.uta.edu/subdue