Biological Data Mining

Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth martyn.ford@port.ac.uk

Collaborators • Dr Anthony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. abrowne@lgu.ac.uk • Professor Philip Picton School of Technology and Design, University College Northampton. phil.picton@northampton.ac.uk • Dr David Whitley Centre for Molecular Design, University of Portsmouth. david.whitley@port.ac.uk

Objectives • The project aims: • to develop & validate techniques for extracting explicit information from bioinformatic data • to express this information as logical rules and decision trees • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

Extracting information • Artificial neural networks (ANNs) can be used to identify the non-linear relationships that underlie bioinformatic data, but . . . • trained ANNs do not lead to a concise and explicit model • specifying the underlying structure is therefore difficult • as a result, ANNs are often regarded as ‘black boxes’

Data Mining and Neural Networks • Standard data mining algorithms exist (such as ID3 or C5) so why use an ANN? It would be advantageous if the rules extracted: • Give a better fit to the data with the same number of rules (i.e. explain the data more accurately); • Give the same fit to the data with less rules (i.e. explain the data more comprehensibly); or • Give both a better fit to the data and use less rules (i.e. explain the data more comprehensibly and more accurately).

Extracting Decision Trees • The TREPAN procedure (Craven,1996) • extracts decision trees from ANNs • performs better than the symbolic learning algorithms ID3 and C5 • the current implementation is restricted to a particular network architecture, but • the underlying algorithm is independent of network architecture

Trepan • Builds a decision tree representing the function the ANN has learnt by recursively partitioning the input space. • Draws query instances by taking into account the distribution of instances in the problem domain. • For real-valued features uses kernel density estimates to generate a model of the underlying data that is used to select instances for presentation to the network.

Trepan • Builds the decision tree in a best-first manner: • as each node is added the fidelity of the decision tree to the ANN is maximised • this is done by examining the significance of the distributions at consecutive levels of the tree (Kolmogorov-Smirnoff test for real valued features, chi-squared for discrete ones) • Allows the user to control the size of the final tree by selecting appropriate stopping criteria.

Aims • Implement the TREPAN algorithm in a portable format, independent of network architecture. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm. • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).

Aims • Apply the extracted decision trees • to searches of bioinformatic databases • protein databases • genomic databases • to searches of cheminformatic databases • chemical libraries • natural product databases • to investigate ligand/receptor binding • to quantify molecular similarity/diversity • to identify new leads and optimise properties

Case study: ligand interaction with GPCRs • 28 GPCRs • a number of putative interaction sites • 3 principal properties of amino acids (AAs) • MLR results for 2 ligands

Biological Data Mining