Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores

Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM Foundations and New Directions of Data Mining Workshop 19 November 2003

Rule-Based Learning • Goal: Induce a rule (or rules) that explains ALL positiveexamples and NO negative examples positive examples negative examples

Inductive Logic Programming (ILP) • Encode background knowledge in first-order logic as facts… containsBlock(ex1,block1A). containsBlock(ex1,block1B). is_red(block1A). is_square(block1A). is_blue(block1B). is_round(block1B). on_top_of(block1B,block1A). and logical relations … above(A,B) :- onTopOf(A,B) above(A,B) :- onTopOf(A,Z), above(Z,B).

+ + + + + + + + + + + Inductive Logic Programming (ILP) • Covering algorithm applied to explain all data - - - - - - - - - Repeat until every positive example is covered Generate best rule that covers this example Remove all examples covered by this rule Choose some positive example

Inductive Logic Programming (ILP) • Saturate an example by writing everything true about it • The saturation of an example is the bottom clause () positive(ex2) :- contains_block(ex2,block2A), contains_block(ex2,block2B), contains_block(ex2,block2C), isRed(block2A), isRound(block2A), isBlue(block2B), isRound(block2B), isBlue(block2C), isSquare(block2C), onTopOf(block2B,block2A), onTopOf(block2C,block2B), above(block2B,block2A), above(block2C,block2B), above(block2C,block2A). ex2 C B A

Inductive Logic Programming (ILP) • Candidate clauses are generated by • choosing literals from  • converting ground terms to variables • Search through the space of candidate clauses using standard AI search algo • Bottom clause ensures search finite Selected literals from  containsBlock(ex2,block2B) isRed(block2A) onTopOf(block2B,block2A) Candidate Clause positive(A) :- containsBlock(A,B), onTopOf(B,C), isRed(C).

ILP Time Complexity • Time complexity of ILP systems depends on • Size of bottom clause|| • Maximum clause lengthc • Number of examples |E| • Search algorithmΠ • O(||c|E|) for exhaustive search • O(|||E|) for greedy search • Assumes constant-time clause evaluation!

Ideas in Speeding Up ILP • Search algorithm improvements • Better heuristic functions, search strategy • Srinivasan’s (2000) random uniform sampling (consider O(1) candidate clauses) • Faster clause evaluations • Evaluation time of a clause (on 1 example) exponentialin number of variables • Clause reordering & optimizing (Blockeel et al 2002, Santos Costa et al 2003) • Evaluation of a candidate still O(|E|)

A Faster Clause Evaluation • Our idea:predict clause’s evaluation in O(1) time (i.e., independent of number of examples) • Use multilayer feed-forward neural network to approximately score candidate clauses • NN inputs specify bottom clauseliterals selected • There is a unique input for every candidate clause in the search space

1 containsBlock(ex2,block2B) 1 onTopOf(block2B,block2A) 0 isRound(block2A) 1 isRed(block2A) Neural Network Topology Selected literals from  Candidate Clause containsBlock(ex2,block2B) positive(A) :- containsBlock(A,B), onTopOf(B,C), isRed(C). onTopOf(block2B,block2A) isRed(block2A) predicted output Σ

Speeding Up ILP • Trained neural network provides a tool for approximate evaluation in O(1) time • Given enough examples (large |E|), approximate evaluation is free versus evaluation on data • During ILP’s search over hypothesis space … • Approximately evaluate every candidate explored • Only evaluate a clause on data if it is “promising” • Adaptive Sampling – use real evaluations to improve approximation during search

When to Evaluate Approximated Clauses? • Treat neural network-predicted score as a Gaussian distribution of true score • Only evaluate clauses when there is sufficient likelihood it is the best seen so far, e.g. current best P(Best) = 0.03don’t evaluate Pred=11.1 Best = 22 Pred=18.9 P(Best) = 0.24evaluate current hypothesis ← clause scores → potential moves

posCovered – negCovered – length + 1 compression = totalPositives Results • Trained learning only on benchmark datasets • Carcinogenesis • Mutagenesis • Protein Metabolism • Nuclear Smuggling • Clauses generated by random sampling • Clause evaluation metric • 10-fold c.v. learning curves

Results

PredictedScore Space of Clauses Future Work • Test in an ILP system • Potential for speedup in datasets with many examples • Will inaccuracy hurt search? • The trained network defines a function over the space of candidate clauses • We can use this function … • Extract concepts • Escape local maxima in heuristic search

Acknowledgements Funding provided by • NLM grant 1T15 LM007359-01 • NLM grant 1R01 LM07050-01 • DARPA EELD grant F30602-01-2-0571

Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores