Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Learning Ensembles ofFirst-Order Clauses for Recall-Precision CurvesA Case Study inBiomedical Information Extraction Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA 19 Sept 2004

Talk Outline • Inductive Logic Programming • Biomedical Information Extraction • Our Gleaner Approach • Aleph Ensembles • Evaluation and Results • Future Work

Inductive Logic Programming • Machine Learning • Classify data into categories • Divide data into train and test sets • Generate hypotheses on train set and then measure performance on test set • In ILP, data are Objects … • person, block, molecule, word, phrase, … • and Relations between them • grandfather, has_bond, is_member, …

Learning daughter(A,B) • Positive • daughter(mary, ann) • daughter(eve, tom) • Negative • daughter(tom, ann) • daughter(eve, ann) • daughter(ian, tom) • daughter(ian, ann) • … • Background Knowledge • mother(ann, mary) • mother(ann, tom) • father(tom, eve) • father(tom, ian) • female(ann) • female(mary) • female(eve) • male(tom) • male(ian) Ann Mother Mother Tom Mary Father Father Eve Ian • Possible Rules • daughter(A,B) :- true. • daughter(A,B) :- female(A). • daughter(A,B) :- female(A), male(B). • daughter(A,B) :- female(A), father(B,A). • daughter(A,B) :- female(A), mother(B,A). • …

ILP Domains • Object Learning • Trains, Carcinogenesis • Link Learning • Binary predicates

Biomedical Information Extraction *image courtesy of National Human Genome Research Institute

Yeast Protein Database

Biomedical Information Extraction • Given: Medical Journal abstracts tagged with protein localization relations • Do: Construct system to extract protein localization phrases from unseen text NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism.

sentence noun phrase noun phrase noun phrase verb phrase noun phrase prep phrase … noun verb article adj noun prep Biomedical Information Extraction NPL3 encodes a nuclear protein with … alphanumeric marked location

S C B A article Sample Extraction Structure • Find structures using ILP contains alphanumeric P noun L noun contains marked location contains no between halfX verb contains alphanumeric

Protein Localization Extraction • Hand-labeled dataset(Ray & Craven ’01) • 7,245 sentences from 871 abstracts • Examples are phrase-phrase combinations • 1,810 positive & 279,154 negative • 1.6 GB ofbackground knowledge • Structural, Statistical, Lexical and Ontological • In total, 200+ distinct background predicates

Our Generate-and-Test Approach Parsed sentence (NP’s non-blue) NPL3 encodes a nuclear protein with … Candidates generated rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc)rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc) rel(Prot, Loc)

Some Ranking Predicates • High-scoring words in protein phrases • repressor, ypt1p, nucleoporin • High-scoring words in location phrases • cytoskeleton, inner, predominately • High-scoring BETWEEN prot & loc • cofraction, mainly, primarily, …, locate • Stemming seemed to hurt here … • Warning: must do PER fold

Some Biomedical Predicates • On-Line Medical Dictionary • natural source for semantic classes • eg, word occurs in category ‘cell biology’ • Medical Subject Headings (MeSH) • canonized method for indexing biomedical articles • ISA hierarchy of words and subcategories • Gene Ontology (GO) • another ISA hierarchy of biological knowledge

Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold) Relative Location of Phrases protein_before_location(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2) Some More Predicates

Link Learning • Large skew toward negatives • 500 relational objects • 5000 positive links means 245,000 negative links • Difficult to measure success • Always negative classifier is 98% accurate • ROC curves look overly optimistic • Enormous quantity of data • 4,285,199,774 web pages indexed by Google • PubMed includes over 15 million citations

Our Approach • Develop fast ensemble algorithms focused on recalland precision evaluation • Key Ideas of Gleaner • Keep wide range of clauses • Create separate theories for different recall ranges • Evaluation • Area Under Recall Precision Curve (AURPC) • Time = Number of clauses considered

TP TP + + TP TP FN FP Gleaner - Background • Prediction vs Actual • Positive or Negative • True or False • Focus on positive examples • Recall = • Precision = prediction TP TN actual FN FP

Gleaner - Background • Seed Example • A positive example that our clause must cover • Bottom Clause • All predicates which are true about seed example • Rapid Random Restart (Zelezny et al ILP 2002) • Stochastic selection of starting clause • Time-limited local heuristic search • We store variety of clauses (based on recall)

Gleaner - Learning • Create B Bins • Generate Clauses • Record Best • Repeat for K seeds Precision Recall

Gleaner - Combining • Combine K clauses per bin • If at least L of K clauses match, call example positive • How to choose L ? • L=1 then high recall, low precision • L=K then low recall, high precision • Our method • Choose L such that ensemble recall matches bin b • Bin b’s precision should be higher than any clause in it • We should now have set of high precisionrule sets spanning space of recall levels

How to use Gleaner • Generate Curve • User Selects Recall Bin • Return ClassificationsWith Precision Confidence Precision Recall = 0.50 Precision = 0.70 Recall

Aleph - Learning • Aleph learnstheories of clauses(Srinivasan, v4, 2003) • Pick positive seed example, find bottom clause • Use heuristic search to find best clause • Pick new seed from uncovered positivesand repeat until threshold of positives covered • Theory produces one recall-precision point • Learning complete theories is time-consuming • Can produce ranking with ensembles

Aleph Ensembles • We compare to ensembles of theories • Algorithm (Dutra et al ILP 2002) • Use K different initial seeds • Learn K theories containing C clauses • Rank examples by the number of theories • Need to balance C for high performance • Small C leads to low recall • Large C leads to converging theories

Aleph Ensembles (100 theories)

Evaluation Metrics • Area Under Recall-Precision Curve (AURPC) • All curves standardized to cover full recall range • Averaged AURPC over 5 folds • Number of clauses considered • Rough estimate of time • Both are “stop anytime” parallel algorithms 1.0 Precision Recall 1.0

AURPC Interpolation • Convex interpolation in RP space? • Precision interpolation is counterintuitive • Example: 1000 positive & 9000 negative 750 4750 0.75 0.53 0.75 0.14 Example Counts ROC Curves RP Curves

AURPC Interpolation

Experimental Methodology • Performed five-fold cross-validation • Variation of parameters • Gleaner (20 recall bins) • # seeds = {25, 50, 75, 100} • # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} • Ensembles (0.75 minacc, 35,000 nodes) • # theories = {10, 25, 50, 75, 100} • # clauses per theory = {1, 5, 10, 15, 20, 25, 50}

Results: Testfold 5 at 1,000,000 clauses Gleaner Ensembles

Results: Gleaner vs Aleph Ensembles

Further Results

Conclusions • Gleaner • Focuses on recall and precision • Keeps wide spectrum of clauses • Good results in few cpu cycles • Aleph ensembles • ‘Early stopping’ helpful • Require more cpu cycles • AURPC • Useful metric for comparison • Interpolation unintuitive

Future Work • Improve Gleaner performance over time • Explore alternate clause combinations • Better understanding of AURPC • Search for clauses that optimize AURPC • Examine more ILP link-learning datasets • Use Gleaner with other ML algorithms

Acknowledgements • USA NLM Grant 5T15LM007359-02 • USA NLM Grant 1R01LM07050-01 • USA DARPA Grant F30602-01-2-0571 • USA Air Force Grant F30602-01-2-0571 • Condor Group • David Page • Vitor Santos Costa, Ines Dutra • Soumya Ray, Marios Skounakis, Mark Craven Dataset available at (URL in proceedings) ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location

Deleted Scenes • Clause Weighting • Gleaner Algorithm Director Commentary on off

Take-Home Message • Definition of Gleaner • One who gathers grain left behind by reapers • Gleaner and ILP • Many clauses constructed and evaluated in ILP hypothesis search • We need to make better use of those that aren’t the highest scoring ones • Thanks, Questions?

Clause Weighting • Single Theory Ensemble • rank by how many clauses cover examples • Weight clauses using tuneset statistics • CN2 (average precision of matching clauses) • Lowest False Positive Rate Score • Cumulative • F1 score Recall • Precision Diversity

Clause Weighting

Gleaner Algorithm • Create B equal-sized recall bins • For K different seeds • Generate rules using Rapid Random Restart • Record best rule (precision x recall)found for each bin • For each recall bin B • Find threshold L of K clauses such thatrecall of “at least L of K clauses match examples”= recall for this bin • Find recall and precision on testset using each bin’s “at least L of K” decision process

Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences