Gleaning Relational Information from Biomedical Text

Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006

Outline • The Vacation Game • Formalizing with Logic • Biomedical Information Extraction • Evaluating Hypotheses • Gleaning Logical Rules • Experiments • Current Directions

The Vacation Game • Positive • Negative

The Vacation Game • Positive • Apple • Feet • Luggage • Mushrooms • Books • Wallet • Beekeeper • Positive • Apple • Feet • Luggage • Mushrooms • Books • Wallet • Beekeeper • Negative • Pear • Socks • Car • Fungus • Novel • Money • Hive

The Vacation Game • My Secret Rule • The word must have two adjacent letters which are the same letter. • Found by using inductive logic • Positive and Negative Examples • Formulating and Eliminating Hypotheses • Evaluating Success and Failure

Inductive Logic Programming • Machine Learning • Classify data into categories • Divide data into train and test sets • Generate hypotheses on train set and then measure performance on test set • In ILP, data are Objects … • person, block, molecule, word, phrase, … • and Relations between them • grandfather, has_bond, is_member, …

Objects Relations w2169_1 w2169_2 a p w2169_3 p l w2169_4 e w2169_5 Formalizing with Logic w2169 apple a b c d e f g h i j k l m n o p q r s t u v w x y z

w2169_1 w2169_2 w2169_3 w2169_4 w2169_5 head Variables body Formalizing with Logic word(w2169). letter(w2169_1). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). next(w2169_2, w2169_3). letter_value(w2169_2, ‘p’). letter_value(w2169_3, ‘p’). ‘apple' w2169 a b c d e f g h i j k l m n o p q r s t u v w x y z pos(X) :- has_letter(X, A), has_letter(X, B), next(A, B), letter_value(A, C), letter_value(B, C).

Structured Database Biomedical Information Extraction *image courtesy of SEER Cancer Training Site

Biomedical Information Extraction http://www.geneontology.org

Biomedical Information Extraction NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.

Biomedical Information Extraction The dog running down the street tackled and bit my little sister.

sentence noun phrase noun phrase noun phrase verb phrase noun phrase prep phrase … noun verb article adj noun prep Biomedical Information Extraction NPL3 encodes a nuclear protein with …

MedDict Background Knowledge http://cancerweb.ncl.ac.uk/omd/

MeSH Background Knowledge http://www.nlm.nih.gov/mesh/MBrowser.html

GO Background Knowledge http://www.geneontology.org

Biomedical Predicates phrase_contains_medDict_term(Phrase, Word, WordText) phrase_contains_mesh_term(Phrase, Word, WordText) phrase_contains_mesh_disease(Phrase, Word, WordText) phrase_contains_go_term(Phrase, Word, WordText) Lexical Predicates internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word,Fold) Relative Location of Phrases protein_before_location(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2) Some Prolog Predicates

Still More Predicate • High-scoring words in protein phrases • bifunction, repress, pmr1, … • High-scoring words in location phrases • golgi, cytoplasm, er • High-scoring BETWEEN protein & location • across, cofractionate, inside, …

Biomedical Information Extraction • Given: Medical Journal abstracts tagged with biological relations • Do: Construct system to extract related phrases from unseen text • Our Gleaner Approach Develop fast ensemble algorithms focused onrecalland precision evaluation

verb(…) • alphanumeric(…) phrase_child(…, …) • internal_caps(…) noun_phrase(…) phrase_parent(…, …) • long_sentence(…) Using Modes to Chain Relations Word Phrase Sentence

Phrase Phrase Sentence Growing Rules From Seed NPL3 encodes a nuclear protein with … prot_loc(ab1392078_sen7_ph0, ab1392078_sen7_ph2, ab1392078_sen7). phrase_contains_novelword(ab1392078_sen7_ph0, ab1392078_sen7_ph0_w0). phrase_next(ab1392078_sen7_ph0, ab1392078_sen7_ph1). … noun_phrase(ab1392078_sen7_ph2). word_child(ab1392078_sen7_ph2, ab9018277_sen5_ph11_w3). … avg_length_sentence(ab1392078_sen7). … Word Phrase Word

Growing Rules From Seed prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).

prediction 2PR TP TP TP TN actual + + + TP TP P FN FP R FN FP Rule Evaluation • Prediction vs Actual Positive or Negative True or False • Focus on positive examples Recall = Precision = F1 Score =

Protein Localization Rule 1 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence). 0.15 Recall 0.51 Precision 0.23 F1 Score

Protein Localization Rule 2 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_marked_up_arg2(Location,C) phrase_contains_some_internal_cap_word(Protein,_), word_previous(C,_). 0.86 Recall 0.12 Precision 0.21 F1 Score

Precision-Focused Search

Recall-Focused Search

F1-Focused Search

Aleph - Learning • Aleph learnstheories of rules(Srinivasan, v4, 2003) • Pick positive seed example • Use heuristic search to find best rule • Pick new seed from uncovered positivesand repeat until threshold of positives covered • Learning theories is time-consuming • Can we reduce time with ensembles?

Gleaner • Definition of Gleaner • One who gathers grain left behind by reapers • Key Ideas of Gleaner • Use Aleph as underlying ILP rule engine • Search rule space with Rapid Random Restart • Keep wide range of rules usually discarded • Create separate theories for diverse recall

Gleaner - Learning • Create B Bins • Generate Clauses • Record Best per Bin Precision Recall

Gleaner - Learning Seed K . . . Seed 3 Seed 2 Seed 1 Recall

pos1: prot_loc(…) 12 Pos pos2: prot_loc(…) 47 Neg pos3: prot_loc(…) 55 Pos neg1: prot_loc(…) 5 Pos Neg neg2: prot_loc(…) 14 pos2: prot_loc(…) neg3: prot_loc(…) 47 2 neg4: prot_loc(…) 18 Neg Pos Gleaner - Ensemble Rules from bin 5 pos1: prot_loc(…) . 12 . . . .

Precision Recall 1.0 0.05 1.00 0.05 0.50 0.10 0.66 Precision 0.85 0.12 0.90 0.13 Recall 1.0 0.90 0.12 Gleaner - Ensemble Examples Score pos3: prot_loc(…) 55 neg28: prot_loc(…) 52 pos2: prot_loc(…) 47 . neg4: prot_loc(…) 18 neg475: prot_loc(…) 17 pos9: prot_loc(…) 17 neg15: prot_loc(…) 16 .

Precision Recall Gleaner - Overlap • For each bin, take the topmost curve

How to use Gleaner • Generate Test Curve • User Selects Recall Bin • Return ClassificationsOrdered By Their Score Precision Recall = 0.50 Precision = 0.70 Recall

Aleph Ensembles • We compare to ensembles of theories • Algorithm (Dutra et al ILP 2002) • Use K different initial seeds • Learn K theories containing C rules • Rank examples by the number of theories • Need to balance C for high performance • Small C leads to low recall • Large C leads to converging theories

Evaluation Metrics 1.0 • Area Under Recall-Precision Curve (AURPC) • All curves standardized to cover full recall range • Averaged AURPC over 5 folds • Number of clauses considered • Rough estimate of time Precision Recall 1.0

YPD Protein Localization • Hand-labeled dataset(Ray & Craven ’01) • 7,245 sentences from 871 abstracts • Examples are phrase-phrase combinations • 1,810 positive & 279,154 negative • 1.6 GB of background knowledge • Structural, Statistical, Lexical and Ontological • In total, 200+ distinct background predicates

Experimental Methodology • Performed five-fold cross-validation • Variation of parameters • Gleaner (20 recall bins) • # seeds = {25, 50, 75, 100} • # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} • Ensembles (0.75 minacc, 1K and 35K nodes) • # theories = {10, 25, 50, 75, 100} • # clauses per theory = {1, 5, 10, 15, 20, 25, 50}

PR Curves - 100,000 Clauses

PR Curves - 1,000,000 Clauses

Protein Localization Results

Genetic Disorder Results

Current Directions • Learn diverse rules across seeds • Calculate probabilistic scores for examples • Directed Rapid Random Restarts • Cache rule information to speed scoring • Transfer learning across seeds • Explore Active Learning within ILP

Take-Home Message • Biology, Gleaner and ILP • Challenging problems in biology can be naturally formulated for Inductive Logic Programming • Many rules constructed and evaluated in ILP hypothesis search • Gleaner makes use of those rules that are not the highest scoring ones for improved speed and performance

Acknowledgements • USA DARPA Grant F30602-01-2-0571 • USA Air Force Grant F30602-01-2-0571 • USA NLM Grant 5T15LM007359-02 • USA NLM Grant 1R01LM07050-01 • UW Condor Group • David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni

Gleaning Relational Information from Biomedical Text

Gleaning Relational Information from Biomedical Text

Presentation Transcript

Information Retrieval from Relational Databases

Disambiguation of Biomedical Text

Extracting Semantic Networks From Text Via Relational Clustering

Information Extraction from Biomedical Text

Heartside Gleaning Initiative

Getting Information from an Informational Text

Biomedical Information Extraction

Information extraction from text

Biomedical Text Analysis

Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text

Information Extraction from biomedical texts

Information extraction from text

Information extraction from text

Biomedical text mining

Interpret and use information from text

Interpret and use information from text

Assignment Purpose Gleaning information from the scenario/tutorialoutlet

Information extraction from text

Information extraction from text

Information extraction from text

Information extraction from text

Information Extraction from BioMedical Abstracts