310 likes | 323 Views
This study presents the Gleaner approach, a fast ensemble algorithm focused on recall and precision evaluation in biomedical information extraction. The evaluation metric used is the Area Under Recall-Precision Curve (AURPC). Results show that Gleaner outperforms the Aleph ensembles method, with potential for further improvements in performance and exploration of alternate clause combinations.
E N D
Learning Ensembles ofFirst-Order Clauses for Recall-Precision CurvesA Case Study inBiomedical Information Extraction Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA 6 Sept 2004
Talk Outline • Link Learning and ILP • Our Gleaner Approach • Aleph Ensembles • Biomedical Information Extraction • Evaluation and Results • Future Work
ILP Domains • Object Learning • Trains, Carcinogenesis • Link Learning • Binary predicates
Link Learning • Large skew toward negatives • 500 relational objects • 5000 positive links means 245,000 negative links • Difficult to measure success • Always negative classifier is 98% accurate • ROC curves look overly optimistic • Enormous quantity of data • 4,285,199,774 web pages indexed by Google • PubMed includes over 15 million citations
Our Approach • Develop fast ensemble algorithms focused on recall and precision evaluation • Key Ideas of Gleaner • Keep wide range of clauses • Create separate theories for different recall ranges • Evaluation • Area Under Recall Precision Curve (AURPC) • Time = Number of clauses considered
TP TP + + TP TP FN FP Gleaner - Background • Focus evaluation on positive examples • Recall = • Precision = • Rapid Random Restart (Zelezny et al ILP 2002) • Stochastic selection of starting clause • Time-limited local heuristic search • We store variety of clauses (based on recall)
Gleaner - Learning • Create B Bins • Generate Clauses • Record Best • Repeat for K seeds Precision Recall
Gleaner - Combining • Combine K clauses per bin • If at least L of K clauses match, call example positive • How to choose L ? • L=1 then high recall, low precision • L=K then low recall, high precision • Our method • Choose L such that ensemble recall matches bin b • Bin b’s precision should be higher than any clause in it • We should now have set of high precisionrule sets spanning space of recall levels
How to use Gleaner • Generate Curve • User Selects Recall Bin • Return ClassificationsWith Precision Confidence Precision Recall = 0.50 Precision = 0.70 Recall
Aleph Ensembles • We compare to ensembles of theories • Algorithm (Dutra et al ILP 2002) • Use K different initial seeds • Learn K theories containing C clauses • Rank examples by the number of theories • Need to balance C for high performance • Small C leads to low recall • Large C leads to converging theories
Biomedical Information Extraction • Given: Medical Journal abstracts tagged with protein localization relations • Do: Construct system to extract protein localization phrases from unseen text NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism.
Biomedical Information Extraction • Hand-labeled dataset (Ray & Craven ’01) • 7,245 sentences from 871 abstracts • Examples are phrase-phrase combinations • 1,810 positive & 279,154 negative • 1.6 GB of background knowledge • Structural, Statistical, Lexical and Ontological • In total, 200+ distinct background predicates
Evaluation Metrics • Two dimensions • Area Under Recall-Precision Curve (AURPC) • All curves standardized to cover full recall range • Averaged AURPC over 5 folds • Number of clauses considered • Rough estimate of time • Both are “stop anytime” parallel algorithms 1.0 Precision Recall 1.0
AURPC Interpolation • Convex interpolation in RP space? • Precision interpolation is counterintuitive • Example: 1000 positive & 9000 negative 750 4750 0.75 0.53 0.75 0.14 Example Counts ROC Curves RP Curves
Experimental Methodology • Performed five-fold cross-validation • Variation of parameters • Gleaner (20 recall bins) • # seeds = {25, 50, 75, 100} • # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} • Ensembles (0.75 minacc, 35,000 nodes) • # theories = {10, 25, 50, 75, 100} • # clauses per theory = {1, 5, 10, 15, 20, 25, 50}
Results: Testfold 5 at 1,000,000 clauses Gleaner Ensembles
Conclusions • Gleaner • Focuses on recall and precision • Keeps wide spectrum of clauses • Good results in few cpu cycles • Aleph ensembles • ‘Early stopping’ helpful • Require more cpu cycles • AURPC • Useful metric for comparison • Interpolation unintuitive
Future Work • Improve Gleaner performance over time • Explore alternate clause combinations • Better understanding of AURPC • Search for clauses that optimize AURPC • Examine more ILP link-learning datasets • Use Gleaner with other ML algorithms
Take-Home Message • Definition of Gleaner • One who gathers grain left behind by reapers • Gleaner and ILP • Many clauses constructed and evaluated in ILP hypothesis search • We need to make better use of those that aren’t the highest scoring ones • Thanks, Questions?
Acknowledgements • USA NLM Grant 5T15LM007359-02 • USA NLM Grant 1R01LM07050-01 • USA DARPA Grant F30602-01-2-0571 • USA Air Force Grant F30602-01-2-0571 • Condor Group • David Page • Vitor Santos Costa, Ines Dutra • Soumya Ray, Marios Skounakis, Mark Craven Dataset available at (URL in proceedings) ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location
Deleted Scenes • Aleph Learning • Clause Weighting • Sample Gleaner Recall-Precision Curve • Sample Extraction Clause • Gleaner Algorithm Director Commentary on off
Aleph - Learning • Aleph learns theories of clauses (Srinivasan, v4, 2003) • Pick a positive seed example and saturate • Use heuristic search to find best clause • Pick new seed from uncovered positivesand repeat until threshold of positives covered • Theory produces one recall-precision point • Learning complete theories is time-consuming • Can produce ranking with theory ensembles
Clause Weighting • Single Theory Ensemble • rank by how many clauses cover examples • Weight clauses using tuneset statistics • CN2 (average precision of matching clauses) • Lowest False Positive Rate Score • Cumulative • F1 score Recall • Precision Diversity
sentence noun phrase verb phrase noun phrase prep phrase … noun verb article adj noun prep Biomedical Information Extraction NPL3 encodes a nuclear protein with … marked location alphanumeric
S C B P noun L noun A article Sample Extraction Clause • P = Protein, L = Location, S = Sentence • 29% Recall 34% Precision on testset 1 contains alphanumeric contains marked location contains no between halfX verb contains alphanumeric
Gleaner Algorithm • Create B equal-sized recall bins • For K different seeds • Generate rules using Rapid Random Restart • Record best rule (precision x recall)found for each bin • For each recall bin B • Find threshold L of K clauses such thatrecall of “at least L of K clauses match examples”= recall for this bin • Find recall and precision on testset using each bin’s “at least L of K” decision process