Biological Data Mining

Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

People Centre for Molecular Design, University of Portsmouth Professor Martyn Ford Dr David Whitley Dr Shuang Cang (Mar - Sept 2000) Dr Abul Azad (Jan 2001 - ) Dr Antony Browne, London Guildhall University. Professor Philip Picton, University College Northampton.

1. Objectives • The project aims: • to develop and validate techniques for extracting explicit information from bioinformatic data • to express this information as logical rules and decision trees • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

2. Methods for Extracting Information • Artificial Neural Networks • good predictive accuracy • hard to decipher • often regarded as ‘black boxes’ • Decision Trees • symbolic rules easier to interpret • more likely to reveal relationships in the data • allow behaviour of individual cases to be explained

3. Extracting Decision Trees • The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. • The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network.

4. Splitting Tests • The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions. • Start with a set of candidate tests • binary tests on each value for nominal features • binary tests on thresholds for real-valued features • Use a beam search with a beam width of two. • Initialize the beam with the candidate test that maximizes the information gain.

5. Splitting Tests (II) • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-n+1 e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} • m+1-of-n+1 e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gain and are significantly different(chi-squared) from existing tests.

6. Example: Substance P Binding to NK1 Receptors • Substance P is a neuropeptide with amino acid sequence H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • Wang et al. (1993) used the multipin technique to synthesize 512 = 29 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions, and measured binding potencies to central NK1 receptors. • The objective was to identify the positions at which stereo-chemistry affects binding strength.

7. Application of Trepan • A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. • For each network a decision tree was grown using Trepan. • The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins (1999).

8. A Typical Trepan Tree

9. Future Work • Complete the implementation of the Trepan algorithm. • model the distribution of the input data and generate from this a set of query instances that are classified using the network and used as additional training cases during extraction of the tree. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm. • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5). • Apply Trepan to ligand-receptor binding problems.

Biological Data Mining