1 / 17

Biological Data Mining

Biological Data Mining. A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/. 1. Objectives. The project aims: to develop and validate techniques for extracting explicit information from bioinformatic data

Download Presentation

Biological Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

  2. 1. Objectives • The project aims: • to develop and validate techniques for extracting explicit information from bioinformatic data • to express this information as logical rules and decision trees • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

  3. 2. Extracting information • Artificial neural networks can be trained to reproduce the non-linear relationships underlying bioinformatic data with good predictive accuracy • but it is often hard to comprehend those relationships from the internal structure of the network • with the result that networks are often regarded as ‘black boxes’. • Decision treesusing symbolic rules are easier to interpret • leading to a greater likelihood of understanding the relationships in the data • allowing the behaviour of individual cases to be explained.

  4. 3. Extracting Decision Trees • The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. • The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network.

  5. 4. Splitting Tests • The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions. • Start with a set of candidate tests • binary tests on each value for nominal features • binary tests on thresholds for real-valued features • Use a beam search with a beam width of two. • Initialize the beam with the candidate test that maximizes the information gain.

  6. 5. Splitting Tests (II) • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-n+1 e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} • m+1-of-n+1 e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gain and are significantly different(chi-squared) from existing tests.

  7. 6. Example: Substance P Binding to NK1 Receptors • Substance P is a neuropeptide with the sequence: H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • Wang et al. used the multipin technique to synthesize 512 = 29 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions • The aim was to measure binding potencies to NK1 receptors & identify the positions at which stereo-chemistry affects binding strength.

  8. 7. Application of Trepan • A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. • For each network a decision tree was grown using Trepan. • The trees showed high fidelity with the networks on a 10% test set.

  9. 8. Results • Binding activity was determined by five positions, viz. • H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins • Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen).

  10. 9. A Typical Trepan Tree

  11. 10. Test set confusion matrix: tree versus network

  12. 11. Test set confusion matrix: tree versus observed

  13. 12. Future Work • Complete the implementation of the Trepan algorithm. • model the distribution of the input data and generate a set of query instances to be classified by the network & used as additional training cases during tree extraction. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm.

  14. 13. Future Applications • Apply Trepan to ligand-receptor binding problems. • compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).

  15. 14. References • Wang J-X et al. (1993)Study of stereo-requirements of substance P binding to NK1 receptors using analogues with systematic D-amino acid replacements. Biorganic & Medicinal Chemistry Letters, 3, 451-456. • Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen).

  16. Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth martyn.ford@port.ac.uk Research Fellows Dr Shuang Cang Mar - Sept 2000 Dr Abul Azad Jan 2001 -

  17. Collaborators • Dr Antony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. abrowne@lgu.ac.uk • Professor Philip Picton School of Technology and Design, University College Northampton. phil.picton@northampton.ac.uk • Dr David Whitley Centre for Molecular Design, University of Portsmouth. david.whitley@port.ac.uk

More Related