1 / 28

Discovering Substructures in Chemical Toxicity Domain

Discovering Substructures in Chemical Toxicity Domain. Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn Peterson Department of Computer Science and Engineering University of Texas at Arlington. Outline.

Download Presentation

Discovering Substructures in Chemical Toxicity Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn Peterson Department of Computer Science and Engineering University of Texas at Arlington

  2. Outline • Chemical Toxicity Database • Motivation and Goal • Knowledge Discovery in Databases (KDD) • SUBDUE Knowledge Discovery System • Experiments with Unsupervised SUBDUE • Experiments with Supervised SUBDUE • Discussion of Results • Conclusions • Future Work

  3. Chemical Toxicity Database • Carcinogenesis Prediction Problem • Toxicology Evaluation Challenge • Domain: Compounds + - Total Training set 162 136 298 Experimental set  27  25 69

  4. Motivation and Goal • Ever-increasing number of chemical compounds • Needs analysis to obtain the Structure-Activity relationships of a compound • Determine SUBDUE’s applicability to chemical toxicity domain

  5. Knowledge Discovery in Databases (KDD) • Process of identifying valid, novel, potentially useful and understandable patterns in data • Goal of Knowledge Discovery: Verification Discovery • Data mining methods • Model Representation, Evaluation and Search

  6. Steps in KDD • Identify the goal of the process • Collect, create and prepare the dataset • Select the data mining method • Select the data mining algorithm • Transform the data • Execute the algorithm • Interpret/evaluate the discovered patterns • Consolidate the knowledge discovered

  7. SUBDUE Knowledge Discovery System • SUBDUE discovers patterns [substructures] in structural data sets Vertices: objects or attributes Edges: relationships shape triangle object shape on square object 4 instances of

  8. SUBDUE - Input Representation • Each atom is represented as a vertex with directed edges to the name, type and the partial charge of the atom • Bonds are represented as undirected edges • Each group is represented as a vertex having a string label specifying the group name with directed edges to all participating atom vertices

  9. SUBDUE - Input Representation • Representation used in Unsupervised SUBDUE A vertex having a string label specifying the alert with directed edges to all the atoms in the compound • Representation used in Supervised SUBDUE A vertex for all the compounds with string label compound The compound vertex has directed edges to all the vertices representing the activity of an alert on a compound

  10. Unsupervised SUBDUE Input Representation Example C 10 10 0.063 C 0.062 t n t p n p Atom Atom 1 gr n - Name t - Type p - Partial charge po - Positive gr - group po po gr Ames Methyl

  11. Supervised SUBDUE Input Representation Example C 10 10 0.063 C 0.062 t n t p n p Atom Atom 1 gr contains n - Name t - Type p - Partial charge gr - group Com - Compound gr contains Com Methyl Positive Ames

  12. SUBDUE - Model Evaluation • Minimum Description Length Principle Best theory to describe any graph Minimize I(S) + I(G/S) • Graph Compression

  13. Other important Concepts of SUBDUE • Inexact Graph Match Approach • Concept - Learning • Predefined Substructures

  14. Unsupervised SUBDUE - Methodology • Training set further divided • 3 approaches to determine carcinogenicity of compounds in experimental set -- Apply SUBDUE individually to the compounds -- Inclusion of pre-defined substructures -- Check for matching of substructure in the compound to be classified

  15. Unsupervised SUBDUE - Results 10 3 0.062 0.057 c br t p t p n n atom atom 1 • Third approach used to classify compounds in experimental set • Accuracy Level -> 0.322 • Cyanate & ether groups are also discovered to be indicators of carcinogenic activity

  16. Supervised SUBDUE - Methodology • Create set of indicators of carcinogenic activity • Create set of indicators of noncarcinogenic activity • Calculate value of substructures discovered in carcinogenic and noncarcinogenic set • Select a set of substructures to be used in classifying compounds in experimental set

  17. Supervised SUBDUE - Methodology • Check for the existence of these substructures in the compound to be classified • Calculate the Carcinogenic Activity Value of the compound • Calculate the NonCarcinogenic Activity Value of the compound • Determine the activity of the compound

  18. Supervised SUBDUE - Results • A set of 12 substructures discovered by SUBDUE used to classify compounds in the experimental set • 6 substructures from carcinogenic set include substructures which form part of groups like amino, di10, methyl, ether, halide10 and substructure which indicates compound testing positive on AMES, Salmonella, etc. • 6 substructures from noncarcinogenic set include substructures which form part of groups like methoxy, Ar_Halide, di64, nitro and alkyl_halide and substructure which indicates compound testing negative on AMES, Salmonella, etc.

  19. Supervised SUBDUE - Substructure Example - Carcinogenic Set positive Ames Salmonella positive Compound Salmonella_n positive

  20. Supervised SUBDUE - Substructure Example - Carcinogenic Set Cl 93 10 -0.123 C n -0.024 t t p n Atom p Atom n - Name t - Type p - Partial charge gr - group gr gr Halide10

  21. Supervised SUBDUE - Substructure Example - NonCarcinogenic Set negative Ames Salmonella negative Compound Cytogen_ca negative

  22. Supervised SUBDUE - Substructure Example - NonCarcinogenic Set Cl 93 10 -0.124 C 0.477 n t t p n p Atom Atom n - Name t - Type p - Partial charge gr - group A-H - Alkyl Halide gr gr A-H

  23. Supervised SUBDUE - Results • PTE-1 Results: Compounds + - Total PTE-1 20 19 39 Correct Prediction 12 6 18 Incorrect Prediction 8 13 22 • Accuracy: 0.6 (+ ), 0.315 (-) , 0.462 (total)

  24. Supervised SUBDUE - Results • PTE-2 Results: Compounds + - Total PTE-2 7 6 13* Correct Prediction 4 3 7 Incorrect Prediction 3 3 6 * : # of compounds whose activity is known • Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 (total)

  25. Results - Discussion • Unsupervised SUBDUE successful in discovering lead indicators of carcinogenic activity • Supervised SUBDUE also successful in discovering lead indicators of carcinogenic activity • ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62) • Ashby, TOPKAT are other toxicity prediction methods

  26. Conclusions • Consistent with results obtained by logic based systems like PROGOL • Prefer to use Concept Learner when positive and negative examples of target concept available • SUBDUE is capable of discovering lead indicators of carcinogenic/noncarcinogenic activity in chemical toxicity domain .

  27. Future Work • PTE-3 Evaluation Challenge • Trimmed Data Sets (Partial Charge) • Newer Version of Concept Learning SUBDUE being developed

  28. Reference http://cygnus.uta.edu/subdue

More Related