1 / 11

Biological Data Mining

Biological Data Mining. A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/. People. Centre for Molecular Design, University of Portsmouth Professor Martyn Ford Dr David Whitley Dr Shuang Cang (Mar - Sept 2000) Dr Abul Azad (Jan 2001 - )

janekeith
Download Presentation

Biological Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

  2. People Centre for Molecular Design, University of Portsmouth Professor Martyn Ford Dr David Whitley Dr Shuang Cang (Mar - Sept 2000) Dr Abul Azad (Jan 2001 - ) Dr Antony Browne, London Guildhall University. Professor Philip Picton, University College Northampton.

  3. 1. Objectives • The project aims: • to develop and validate techniques for extracting explicit information from bioinformatic data • to express this information as logical rules and decision trees • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

  4. 2. Methods for Extracting Information • Artificial Neural Networks • good predictive accuracy • hard to decipher • often regarded as ‘black boxes’ • Decision Trees • symbolic rules easier to interpret • more likely to reveal relationships in the data • allow behaviour of individual cases to be explained

  5. 3. Extracting Decision Trees • The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. • The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network.

  6. 4. Splitting Tests • The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions. • Start with a set of candidate tests • binary tests on each value for nominal features • binary tests on thresholds for real-valued features • Use a beam search with a beam width of two. • Initialize the beam with the candidate test that maximizes the information gain.

  7. 5. Splitting Tests (II) • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-n+1 e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} • m+1-of-n+1 e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gain and are significantly different(chi-squared) from existing tests.

  8. 6. Example: Substance P Binding to NK1 Receptors • Substance P is a neuropeptide with amino acid sequence H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • Wang et al. (1993) used the multipin technique to synthesize 512 = 29 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions, and measured binding potencies to central NK1 receptors. • The objective was to identify the positions at which stereo-chemistry affects binding strength.

  9. 7. Application of Trepan • A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. • For each network a decision tree was grown using Trepan. • The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins (1999).

  10. 8. A Typical Trepan Tree

  11. 9. Future Work • Complete the implementation of the Trepan algorithm. • model the distribution of the input data and generate from this a set of query instances that are classified using the network and used as additional training cases during extraction of the tree. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm. • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5). • Apply Trepan to ligand-receptor binding problems.

More Related