Comparison of Neural Network and Symbolic Techniques in Biological Data Mining

Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

1. Objectives • The project aims: • to develop and validate techniques for extracting explicit information from bioinformatic data • to express this information as logical rules and decision trees • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

2. Extracting information • Artificial neural networks can be trained to reproduce the non-linear relationships underlying bioinformatic data with good predictive accuracy • but it is often hard to comprehend those relationships from the internal structure of the network • with the result that networks are often regarded as ‘black boxes’. • Decision treesusing symbolic rules are easier to interpret • leading to a greater likelihood of understanding the relationships in the data • allowing the behaviour of individual cases to be explained.

3. Extracting Decision Trees • The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. • The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network.

4. Splitting Tests • The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions. • Start with a set of candidate tests • binary tests on each value for nominal features • binary tests on thresholds for real-valued features • Use a beam search with a beam width of two. • Initialize the beam with the candidate test that maximizes the information gain.

5. Splitting Tests (II) • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-n+1 e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} • m+1-of-n+1 e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gain and are significantly different(chi-squared) from existing tests.

6. Example: Substance P Binding to NK1 Receptors • Substance P is a neuropeptide with the sequence: H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • Wang et al. used the multipin technique to synthesize 512 = 29 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions • The aim was to measure binding potencies to NK1 receptors & identify the positions at which stereo-chemistry affects binding strength.

7. Application of Trepan • A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. • For each network a decision tree was grown using Trepan. • The trees showed high fidelity with the networks on a 10% test set.

8. Results • Binding activity was determined by five positions, viz. • H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins • Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen).

9. A Typical Trepan Tree

10. Test set confusion matrix: tree versus network

11. Test set confusion matrix: tree versus observed

12. Future Work • Complete the implementation of the Trepan algorithm. • model the distribution of the input data and generate a set of query instances to be classified by the network & used as additional training cases during tree extraction. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm.

13. Future Applications • Apply Trepan to ligand-receptor binding problems. • compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).

14. References • Wang J-X et al. (1993)Study of stereo-requirements of substance P binding to NK1 receptors using analogues with systematic D-amino acid replacements. Biorganic & Medicinal Chemistry Letters, 3, 451-456. • Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen).

Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth martyn.ford@port.ac.uk Research Fellows Dr Shuang Cang Mar - Sept 2000 Dr Abul Azad Jan 2001 -

Collaborators • Dr Antony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. abrowne@lgu.ac.uk • Professor Philip Picton School of Technology and Design, University College Northampton. phil.picton@northampton.ac.uk • Dr David Whitley Centre for Molecular Design, University of Portsmouth. david.whitley@port.ac.uk

Comparison of Neural Network and Symbolic Techniques in Biological Data Mining

Comparison of Neural Network and Symbolic Techniques in Biological Data Mining

Presentation Transcript

EECS 800 Research Seminar Mining Biological Data

Biological literature mining

BIOLOGICAL Data Mining

Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

G53BIO – Bioinformatics Biological Data Mining

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

Biological Data Mining

EECS 800 Research Seminar Mining Biological Data

Biological Data Mining

EECS 800 Research Seminar Mining Biological Data

Elementary approach towards Biological Data Mining

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

Biological Data Mining