Machine Learning in Bioinformatics

Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Talk Overview • Our research group • Aims, people, publications • Machine learning • A balancing act • Bioinformatics • Holy grails • Our bioinformatics research projects • From small to large • A future direction • Integration of reasoning techniques

Computational BioinformaticsLaboratory • Our aim is to: • Study the theory, implementation and application of computational techniques to problems in biology and medicine • Our emphasis is on: • Machine learning representations, algorithms and applications • Our favourite techniques are: • ILP, SLPs, ATF, ATP, CSP, GAs, SVMs • Kernel methods, Bayes nets, Action Languages • The (major) research tools we’ve produced are: • Progol, HR, MetaLog (in production)

The Research Group Members • Hiroaki Watanabe (RA, BBSRC) • Alireza Tamaddoni-Nezhad (RA, DTI) • Stephen Muggleton (Professor) • Ali Hafiz (PhD) • Huma Lodhi (RA, DTI) • Simon Colton (Lecturer) • Jung-Wook Bang (RA, DTI) • (Nicos Angeloupolos, now in York) (RA, BBSRC) • Room 407 • http://www.doc.ic.ac.uk/bioinformatics

Some External Collaborators • Mike Sternberg (Biochemistry, Imperial) • Jeremy Nicholson (Biomedical Sciences, Imperial) • Steve Oliver (Biology, Manchester) • Ross King (Computing, Aberystwyth) • Doug Kell (Chemistry, Manchester) • Chris Rawlings (Oxagen) • Charlie Hodgman (GSK) • Alan Bundy (Informatics, Edinburgh) • Toby Walsh (Cork Constraint Computation Centre)

Some Departmental Collaborators • Krysia Broda, Allesandro Russo, Oliver Ray • Aspects of ILP and ALP • Marek Sergot • Action Languages • Tony Kakas (Visiting professor, Cyprus) • Abductive Logic Programming

Machine Learning Overview • Ultimately about writing programs which improve with experience • Experience through data • Experience through knowledge • Experience through experimentation (active) • Some common tasks: • Concept learning for prediction • Clustering • Association rule mining

Maintaining a Balance Predictive tasks Supervised learning Know what you’re looking for Don’t know what you’re looking for Don’t know you’re even looking Unsupervised learning Descriptive tasks

A Partial Characterisation of Learning Tasks • Concept learning • Outlier/anomaly detection • Clustering • Concept formation • Conjecture making • Puzzle generation • Theory formation

Maintaining a Balance in Predictive/Descriptive tasks • Predictive tasks • From accuracy to understanding • Need to show statistical significance • But hypotheses generated often need to be understandable • Difference between the stock market and biology • Descriptive tasks • From pebbles to pearls • Lots of rubbish produced • Cannot rely on statistical significance • Have to worry about notions of interestingness • And provide tools to extract useful information from output

Maintaining a Balance in Scientific Discovery tasks • Machine learning researchers • Are generally not domain scientists also • Extremely important to collaborate • To provide interesting projects • Remembering that we are scientists not IT consultants • To gain materials • Data, background knowledge, heuristics, • To assess the value of the output

Inductive Logic Programming • Concept/rule learning technique (usually) • Hypotheses represented as Logic Programs • Search for LPs • From general to specific or vice-versa • One method is inverse entailment • Use measures to guide the search • Predictive accuracy and compression (info. theory) • Search performed within a language bias • Produces good accuracy and understanding • Logic programs are easier to decipher than ANNs • Our implementation: Progol (and others)

Example learned LP • Predicting protein folds from helices fold('Four-helical up-and-down bundle',P) :- helix(P,H1), length(H1,hi), position(P,H1,Pos), interval(1 =< Pos =< 3), adjacent(P,H1,H2), helix(P,H2).

Stochastic Logic Programs • Generalisation of HMMs • Probabilistic logic programs • More expressive language than LPs • Quantative rather than qualitative • Express arbitrary intervals over probability distributions • Issues in learning SLPs • Structure estimation • Parameter estimation • Applications • More appropriate for biochemical networks

Automated Theory Formation • Descriptive learning technique • Which can also be used for prediction tasks • Cycle of activity • Form concepts, make hypotheses, explain hypotheses, evaluate concepts, start again,… • 15 production rules for concepts • 7 methods to discover and extract conjectures • Uses third party software to prove/disprove (maths) • 25 heuristic measures of interestingness • Project: see whether this works in bioinformatics • Our implementation: HR

Other Machine Learning Methods used in our Group • Genetic algorithms • To perform ILP search (Alireza) • Bayes nets • Introduction of hidden nodes (Philip) • Kernel methods • Relational kernels for SVMs and regression (Huma) • Action Languages • Stochastic (re)actions (Hiraoki)

Bioinformatics Overview • “Bioinformatics is the study of information content and information flow in biological systems and proceses” (Michael Liebman) • Not just storage and analysis of huge DNA sequences • “Bioinformaticians have to be a Jack of all trades and a master of one” (Charlie Hodgman, GSK) • Highly collaborative • biology, mathematics, statistics, computer science, biochemistry, physics, chemistry, medicine, …

From Sequence to Structure • There is a computer program…? attcgatcgatcgatcgatcaggcgcgcta Cgagcggcgaggacctcatcatcgatcag… MRPQAPGSLVDPNEDELRMAPWYWGRISREEAKSILHGKPDGSFLVRDALSMKGEYTLTLMKDGCEKLIKICHMDRKYGFIETDLFNSVVEMINYYKENSLSMYNKTLDITLSNPIVRAREDEESQPHGDLCLLSNEFIRTCQLLQNLEQNLENKRNSFNAIREELQEKKLHQSVFGNTEKIFRNQIKLNESFMKAPADA……

Holy Grail Number One • From protein sequence to protein function • HGP data needs to be interpreted • Genome split into genes, which code for a protein • Biological function of protein dictated by structure • Structure of many proteins already determined • By X-ray crystallography • Best idea so far: given a new gene sequence • Find sequence most similar to it with known structure • And look at the structure/function of the protein • Other alternatives • Use ML techniques to predict where secondary structures will occur (e.g., hairpins, alpha-helices, beta-sheets)

Holy Grail Number Two • Drug companies lose millions • Developing drugs which turn out to be toxic • Predictive Toxicology • Determine in advance which will be toxic • Approach 1: Mapping molecules to toxicity • Using ML and statistical techniques • Approach 2: • Producing metabolic explanations of toxic effects • Using probabilistic logics to represent pathways • And learning structures and parameters over this

Other aims of Bioinformatics • Organisation of Data • Cross referencing • Data integration is a massive problem • Analysing data from • High-throughput methods for gene expression • Ask Yike about this! • Produce Ontologies • And get everyone to use them?

Some Current Bioinformatics Projects • SGC • The Substructure Server • SGC and SHM • Discovery in medical ontologies • SHM • Studying biochemical networks (£400k, BBSRC) • Closed loop learning (£200k, EPSRC) • The Metalog project (£1.1 million, DTI) • APRIL 2 (£400k, EC)

A Substructure Server • Lesson from Automated Theorem Proving • Best (most complex) methods not most used • Other considerations: ease of use, stability, simplicity, e.g., Otter • Aim: provide a simple predictive toxicology program • Via a server with a very simple interface • Sub-projects • Find substructures in many positives, few negatives: Colton • Simple Prolog program, writing Java version, use ILP?? • Put program on server: Anandathiyagar (MSc.) • Distribute process over our Linux cluster: Darby (MEng.) • Babel preprocessor (50+ repns), Rasmol back-end: ???

The Substructure Server

Using Medical Ontologies • Use Ontology and ML for database integration • Muggleton and Tamaddoni-Nezhad • Bridge between two disparate databases • LIGAND (biochemical reactions) • Enzyme classification system (EC) = ontology • Automated ontology maintenance • Colton and Traganidas (MSc. Last year) • Gene Ontology (big project) • Use data to find links between GO terms • Equivalence and implication finding using HR

Gene Ontology Discovery 55%

Studying Biochemical Networks • Use SLPs to find mappings between genomes • Map function of pairs of homologous proteins • E.g., mouse and human • Homology is probabilistic • Developed SLP learning algorithms • Initial results applying them in biological networks • Work by • Muggleton, Angeloupolos and Watanabe

Closed Loop Machine Learning • Active learning • Information theoretic algorithm designs and chooses the most informative and lowest cost experiments to carry out • Implemented in the ASE-Progol system • Learning generates hypotheses • Being studied by Ali Hafiz (PhD) • Idea: use machine learning to guide experimentation • using a real robot geneticist in a cyclic process • Aims of current project: determine the function of genes • Cost savings of 2 to 4 times over alternatives • Upcoming Nature article

APRIL 2 • Applications of Probabalistic Relational Induction in Logic • Aim: develop representations and learning algorithms for probabilistic logics • Applications: bioinformatics • Metabolic networks • Phylo-genetics • 2 RAs at Imperial (with Mike Sternberg) • Starting in January

The Metalog Project Overview • Aim: • Modelling disease pathways and predicting toxicity • Gap filling: existing representations correct but incomplete • Predict where the toxin is acting (focus) • Multi-layered problem representation • Meta-network level (Bayes nets) Philip • Network level (SLPs) Huma • Biochemical reaction level (LPs) Alireza • Problog lingua-franca developed • to represent learned knowledge • NMR Data from metabonomics from Jeremy Nicholson • KEGG Background knowledge from Mike Sternberg

The Metalog Project Progress • Year 1 achievements (all objectives achieved) • Function predictions from LIGAND • Mapping between KEGG and metabolic networks • Initial Bayes-net model • Drawn much interest from experts • Agrees with KEGG, and disagrees in interesting ways • Interaction between metabolytes which are not explained • Year 2 • Working towards abductive model for gap filling

Future Directions for Machine Learning in Bioinformatics • In-silico modelling of complete organisms • Representation and reasoning at all levels • From patient to the molecule • Probabalistic models • For more complex biological processes • Such as biochemical pathways

Biochemical Pathways • 1/120th of a biochemical network

Future Directions for My Research • Descriptive Induction meets Biology data • Most ML bioinformatics projects are predictive • Very carefully compressed notions of interestingness • Into a single measure: predictive accuracy • Domain scientist not bombarded with a lot of information • A correctly answered question can be highly revealing • Can we push this envelope slightly? • Use descriptive induction (WARMR, CLAUDIEN, HR) • To tell biologists something they weren’t expecting about the data they have collated • Have to worry hard about dull output • Need to determine heuristics from domain scientists

More Future Directions • Put “Automated Reasoning” back together again • Essential for scientific discovery • ML, ATP, CSP, etc., all work well individually • Surely work better in combination… • Improve ATP to prove a different theorem? • Make flexible using CSP and ATP • Improve ML by rationalising input concepts? • Use ATF and ATP to find concepts and hypotheses • Improve CSP by introducing additional constraints • Use ATF, ML to find constraints, ATP to prove them

Machine Learning in Bioinformatics