1 / 12

Biological Data Mining

This project compares the efficacy of artificial neural networks (ANNs) and symbolic techniques in extracting explicit information from bioinformatic data. The focus is on developing logical rules and decision trees for scientific problems in bioinformatics and cheminformatics.

lgodfrey
Download Presentation

Biological Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

  2. Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth martyn.ford@port.ac.uk

  3. Collaborators • Dr Anthony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. abrowne@lgu.ac.uk • Professor Philip Picton School of Technology and Design, University College Northampton. phil.picton@northampton.ac.uk • Dr David Whitley Centre for Molecular Design, University of Portsmouth. david.whitley@port.ac.uk

  4. Objectives • The project aims: • to develop & validate techniques for extracting explicit information from bioinformatic data • to express this information as logical rules and decision trees • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

  5. Extracting information • Artificial neural networks (ANNs) can be used to identify the non-linear relationships that underlie bioinformatic data, but . . . • trained ANNs do not lead to a concise and explicit model • specifying the underlying structure is therefore difficult • as a result, ANNs are often regarded as ‘black boxes’

  6. Data Mining and Neural Networks • Standard data mining algorithms exist (such as ID3 or C5) so why use an ANN? It would be advantageous if the rules extracted: • Give a better fit to the data with the same number of rules (i.e. explain the data more accurately); • Give the same fit to the data with less rules (i.e. explain the data more comprehensibly); or • Give both a better fit to the data and use less rules (i.e. explain the data more comprehensibly and more accurately).

  7. Extracting Decision Trees • The TREPAN procedure (Craven,1996) • extracts decision trees from ANNs • performs better than the symbolic learning algorithms ID3 and C5 • the current implementation is restricted to a particular network architecture, but • the underlying algorithm is independent of network architecture

  8. Trepan • Builds a decision tree representing the function the ANN has learnt by recursively partitioning the input space. • Draws query instances by taking into account the distribution of instances in the problem domain. • For real-valued features uses kernel density estimates to generate a model of the underlying data that is used to select instances for presentation to the network.

  9. Trepan • Builds the decision tree in a best-first manner: • as each node is added the fidelity of the decision tree to the ANN is maximised • this is done by examining the significance of the distributions at consecutive levels of the tree (Kolmogorov-Smirnoff test for real valued features, chi-squared for discrete ones) • Allows the user to control the size of the final tree by selecting appropriate stopping criteria.

  10. Aims • Implement the TREPAN algorithm in a portable format, independent of network architecture. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm. • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).

  11. Aims • Apply the extracted decision trees • to searches of bioinformatic databases • protein databases • genomic databases • to searches of cheminformatic databases • chemical libraries • natural product databases • to investigate ligand/receptor binding • to quantify molecular similarity/diversity • to identify new leads and optimise properties

  12. Case study: ligand interaction with GPCRs • 28 GPCRs • a number of putative interaction sites • 3 principal properties of amino acids (AAs) • MLR results for 2 ligands

More Related