480 likes | 581 Views
Gleaning Relational Information from Biomedical Text. Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006. Outline. The Vacation Game Formalizing with Logic
E N D
Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006
Outline • The Vacation Game • Formalizing with Logic • Biomedical Information Extraction • Evaluating Hypotheses • Gleaning Logical Rules • Experiments • Current Directions
The Vacation Game • Positive • Negative
The Vacation Game • Positive • Apple • Feet • Luggage • Mushrooms • Books • Wallet • Beekeeper • Positive • Apple • Feet • Luggage • Mushrooms • Books • Wallet • Beekeeper • Negative • Pear • Socks • Car • Fungus • Novel • Money • Hive
The Vacation Game • My Secret Rule • The word must have two adjacent letters which are the same letter. • Found by using inductive logic • Positive and Negative Examples • Formulating and Eliminating Hypotheses • Evaluating Success and Failure
Inductive Logic Programming • Machine Learning • Classify data into categories • Divide data into train and test sets • Generate hypotheses on train set and then measure performance on test set • In ILP, data are Objects … • person, block, molecule, word, phrase, … • and Relations between them • grandfather, has_bond, is_member, …
Objects Relations w2169_1 w2169_2 a p w2169_3 p l w2169_4 e w2169_5 Formalizing with Logic w2169 apple a b c d e f g h i j k l m n o p q r s t u v w x y z
w2169_1 w2169_2 w2169_3 w2169_4 w2169_5 head Variables body Formalizing with Logic word(w2169). letter(w2169_1). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). next(w2169_2, w2169_3). letter_value(w2169_2, ‘p’). letter_value(w2169_3, ‘p’). ‘apple' w2169 a b c d e f g h i j k l m n o p q r s t u v w x y z pos(X) :- has_letter(X, A), has_letter(X, B), next(A, B), letter_value(A, C), letter_value(B, C).
Structured Database Biomedical Information Extraction *image courtesy of SEER Cancer Training Site
Biomedical Information Extraction http://www.geneontology.org
Biomedical Information Extraction NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.
Biomedical Information Extraction The dog running down the street tackled and bit my little sister.
sentence noun phrase noun phrase noun phrase verb phrase noun phrase prep phrase … noun verb article adj noun prep Biomedical Information Extraction NPL3 encodes a nuclear protein with …
MedDict Background Knowledge http://cancerweb.ncl.ac.uk/omd/
MeSH Background Knowledge http://www.nlm.nih.gov/mesh/MBrowser.html
GO Background Knowledge http://www.geneontology.org
Biomedical Predicates phrase_contains_medDict_term(Phrase, Word, WordText) phrase_contains_mesh_term(Phrase, Word, WordText) phrase_contains_mesh_disease(Phrase, Word, WordText) phrase_contains_go_term(Phrase, Word, WordText) Lexical Predicates internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word,Fold) Relative Location of Phrases protein_before_location(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2) Some Prolog Predicates
Still More Predicate • High-scoring words in protein phrases • bifunction, repress, pmr1, … • High-scoring words in location phrases • golgi, cytoplasm, er • High-scoring BETWEEN protein & location • across, cofractionate, inside, …
Biomedical Information Extraction • Given: Medical Journal abstracts tagged with biological relations • Do: Construct system to extract related phrases from unseen text • Our Gleaner Approach Develop fast ensemble algorithms focused onrecalland precision evaluation
verb(…) • alphanumeric(…) phrase_child(…, …) • internal_caps(…) noun_phrase(…) phrase_parent(…, …) • long_sentence(…) Using Modes to Chain Relations Word Phrase Sentence
Phrase Phrase Sentence Growing Rules From Seed NPL3 encodes a nuclear protein with … prot_loc(ab1392078_sen7_ph0, ab1392078_sen7_ph2, ab1392078_sen7). phrase_contains_novelword(ab1392078_sen7_ph0, ab1392078_sen7_ph0_w0). phrase_next(ab1392078_sen7_ph0, ab1392078_sen7_ph1). … noun_phrase(ab1392078_sen7_ph2). word_child(ab1392078_sen7_ph2, ab9018277_sen5_ph11_w3). … avg_length_sentence(ab1392078_sen7). … Word Phrase Word
Growing Rules From Seed prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).
prediction 2PR TP TP TP TN actual + + + TP TP P FN FP R FN FP Rule Evaluation • Prediction vs Actual Positive or Negative True or False • Focus on positive examples Recall = Precision = F1 Score =
Protein Localization Rule 1 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence). 0.15 Recall 0.51 Precision 0.23 F1 Score
Protein Localization Rule 2 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_marked_up_arg2(Location,C) phrase_contains_some_internal_cap_word(Protein,_), word_previous(C,_). 0.86 Recall 0.12 Precision 0.21 F1 Score
Aleph - Learning • Aleph learnstheories of rules(Srinivasan, v4, 2003) • Pick positive seed example • Use heuristic search to find best rule • Pick new seed from uncovered positivesand repeat until threshold of positives covered • Learning theories is time-consuming • Can we reduce time with ensembles?
Gleaner • Definition of Gleaner • One who gathers grain left behind by reapers • Key Ideas of Gleaner • Use Aleph as underlying ILP rule engine • Search rule space with Rapid Random Restart • Keep wide range of rules usually discarded • Create separate theories for diverse recall
Gleaner - Learning • Create B Bins • Generate Clauses • Record Best per Bin Precision Recall
Gleaner - Learning Seed K . . . Seed 3 Seed 2 Seed 1 Recall
pos1: prot_loc(…) 12 Pos pos2: prot_loc(…) 47 Neg pos3: prot_loc(…) 55 Pos neg1: prot_loc(…) 5 Pos Neg neg2: prot_loc(…) 14 pos2: prot_loc(…) neg3: prot_loc(…) 47 2 neg4: prot_loc(…) 18 Neg Pos Gleaner - Ensemble Rules from bin 5 pos1: prot_loc(…) . 12 . . . .
Precision Recall 1.0 0.05 1.00 0.05 0.50 0.10 0.66 Precision 0.85 0.12 0.90 0.13 Recall 1.0 0.90 0.12 Gleaner - Ensemble Examples Score pos3: prot_loc(…) 55 neg28: prot_loc(…) 52 pos2: prot_loc(…) 47 . neg4: prot_loc(…) 18 neg475: prot_loc(…) 17 pos9: prot_loc(…) 17 neg15: prot_loc(…) 16 .
Precision Recall Gleaner - Overlap • For each bin, take the topmost curve
How to use Gleaner • Generate Test Curve • User Selects Recall Bin • Return ClassificationsOrdered By Their Score Precision Recall = 0.50 Precision = 0.70 Recall
Aleph Ensembles • We compare to ensembles of theories • Algorithm (Dutra et al ILP 2002) • Use K different initial seeds • Learn K theories containing C rules • Rank examples by the number of theories • Need to balance C for high performance • Small C leads to low recall • Large C leads to converging theories
Evaluation Metrics 1.0 • Area Under Recall-Precision Curve (AURPC) • All curves standardized to cover full recall range • Averaged AURPC over 5 folds • Number of clauses considered • Rough estimate of time Precision Recall 1.0
YPD Protein Localization • Hand-labeled dataset(Ray & Craven ’01) • 7,245 sentences from 871 abstracts • Examples are phrase-phrase combinations • 1,810 positive & 279,154 negative • 1.6 GB of background knowledge • Structural, Statistical, Lexical and Ontological • In total, 200+ distinct background predicates
Experimental Methodology • Performed five-fold cross-validation • Variation of parameters • Gleaner (20 recall bins) • # seeds = {25, 50, 75, 100} • # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} • Ensembles (0.75 minacc, 1K and 35K nodes) • # theories = {10, 25, 50, 75, 100} • # clauses per theory = {1, 5, 10, 15, 20, 25, 50}
Current Directions • Learn diverse rules across seeds • Calculate probabilistic scores for examples • Directed Rapid Random Restarts • Cache rule information to speed scoring • Transfer learning across seeds • Explore Active Learning within ILP
Take-Home Message • Biology, Gleaner and ILP • Challenging problems in biology can be naturally formulated for Inductive Logic Programming • Many rules constructed and evaluated in ILP hypothesis search • Gleaner makes use of those rules that are not the highest scoring ones for improved speed and performance
Acknowledgements • USA DARPA Grant F30602-01-2-0571 • USA Air Force Grant F30602-01-2-0571 • USA NLM Grant 5T15LM007359-02 • USA NLM Grant 1R01LM07050-01 • UW Condor Group • David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni