1 / 47

Gleaning Relational Information from Biomedical Text

Gleaning Relational Information from Biomedical Text. Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006. Outline. The Vacation Game Formalizing with Logic

trish
Download Presentation

Gleaning Relational Information from Biomedical Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006

  2. Outline • The Vacation Game • Formalizing with Logic • Biomedical Information Extraction • Evaluating Hypotheses • Gleaning Logical Rules • Experiments • Current Directions

  3. The Vacation Game • Positive • Negative

  4. The Vacation Game • Positive • Apple • Feet • Luggage • Mushrooms • Books • Wallet • Beekeeper • Positive • Apple • Feet • Luggage • Mushrooms • Books • Wallet • Beekeeper • Negative • Pear • Socks • Car • Fungus • Novel • Money • Hive

  5. The Vacation Game • My Secret Rule • The word must have two adjacent letters which are the same letter. • Found by using inductive logic • Positive and Negative Examples • Formulating and Eliminating Hypotheses • Evaluating Success and Failure

  6. Inductive Logic Programming • Machine Learning • Classify data into categories • Divide data into train and test sets • Generate hypotheses on train set and then measure performance on test set • In ILP, data are Objects … • person, block, molecule, word, phrase, … • and Relations between them • grandfather, has_bond, is_member, …

  7. Objects Relations w2169_1 w2169_2 a p w2169_3 p l w2169_4 e w2169_5 Formalizing with Logic w2169 apple a b c d e f g h i j k l m n o p q r s t u v w x y z

  8. w2169_1 w2169_2 w2169_3 w2169_4 w2169_5 head Variables body Formalizing with Logic word(w2169). letter(w2169_1). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). next(w2169_2, w2169_3). letter_value(w2169_2, ‘p’). letter_value(w2169_3, ‘p’). ‘apple' w2169 a b c d e f g h i j k l m n o p q r s t u v w x y z pos(X) :- has_letter(X, A), has_letter(X, B), next(A, B), letter_value(A, C), letter_value(B, C).

  9. Structured Database Biomedical Information Extraction *image courtesy of SEER Cancer Training Site

  10. Biomedical Information Extraction http://www.geneontology.org

  11. Biomedical Information Extraction NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.

  12. Biomedical Information Extraction The dog running down the street tackled and bit my little sister.

  13. sentence noun phrase noun phrase noun phrase verb phrase noun phrase prep phrase … noun verb article adj noun prep Biomedical Information Extraction NPL3 encodes a nuclear protein with …

  14. MedDict Background Knowledge http://cancerweb.ncl.ac.uk/omd/

  15. MeSH Background Knowledge http://www.nlm.nih.gov/mesh/MBrowser.html

  16. GO Background Knowledge http://www.geneontology.org

  17. Biomedical Predicates phrase_contains_medDict_term(Phrase, Word, WordText) phrase_contains_mesh_term(Phrase, Word, WordText) phrase_contains_mesh_disease(Phrase, Word, WordText) phrase_contains_go_term(Phrase, Word, WordText) Lexical Predicates internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word,Fold) Relative Location of Phrases protein_before_location(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2) Some Prolog Predicates

  18. Still More Predicate • High-scoring words in protein phrases • bifunction, repress, pmr1, … • High-scoring words in location phrases • golgi, cytoplasm, er • High-scoring BETWEEN protein & location • across, cofractionate, inside, …

  19. Biomedical Information Extraction • Given: Medical Journal abstracts tagged with biological relations • Do: Construct system to extract related phrases from unseen text • Our Gleaner Approach Develop fast ensemble algorithms focused onrecalland precision evaluation

  20. verb(…) • alphanumeric(…) phrase_child(…, …) • internal_caps(…) noun_phrase(…) phrase_parent(…, …) • long_sentence(…) Using Modes to Chain Relations Word Phrase Sentence

  21. Phrase Phrase Sentence Growing Rules From Seed NPL3 encodes a nuclear protein with … prot_loc(ab1392078_sen7_ph0, ab1392078_sen7_ph2, ab1392078_sen7). phrase_contains_novelword(ab1392078_sen7_ph0, ab1392078_sen7_ph0_w0). phrase_next(ab1392078_sen7_ph0, ab1392078_sen7_ph1). … noun_phrase(ab1392078_sen7_ph2). word_child(ab1392078_sen7_ph2, ab9018277_sen5_ph11_w3). … avg_length_sentence(ab1392078_sen7). … Word Phrase Word

  22. Growing Rules From Seed prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).

  23. prediction 2PR TP TP TP TN actual + + + TP TP P FN FP R FN FP Rule Evaluation • Prediction vs Actual Positive or Negative True or False • Focus on positive examples Recall = Precision = F1 Score =

  24. Protein Localization Rule 1 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence). 0.15 Recall 0.51 Precision 0.23 F1 Score

  25. Protein Localization Rule 2 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_marked_up_arg2(Location,C) phrase_contains_some_internal_cap_word(Protein,_), word_previous(C,_). 0.86 Recall 0.12 Precision 0.21 F1 Score

  26. Precision-Focused Search

  27. Recall-Focused Search

  28. F1-Focused Search

  29. Aleph - Learning • Aleph learnstheories of rules(Srinivasan, v4, 2003) • Pick positive seed example • Use heuristic search to find best rule • Pick new seed from uncovered positivesand repeat until threshold of positives covered • Learning theories is time-consuming • Can we reduce time with ensembles?

  30. Gleaner • Definition of Gleaner • One who gathers grain left behind by reapers • Key Ideas of Gleaner • Use Aleph as underlying ILP rule engine • Search rule space with Rapid Random Restart • Keep wide range of rules usually discarded • Create separate theories for diverse recall

  31. Gleaner - Learning • Create B Bins • Generate Clauses • Record Best per Bin Precision Recall

  32. Gleaner - Learning Seed K . . . Seed 3 Seed 2 Seed 1 Recall

  33. pos1: prot_loc(…) 12 Pos pos2: prot_loc(…) 47 Neg pos3: prot_loc(…) 55 Pos neg1: prot_loc(…) 5 Pos Neg neg2: prot_loc(…) 14 pos2: prot_loc(…) neg3: prot_loc(…) 47 2 neg4: prot_loc(…) 18 Neg Pos Gleaner - Ensemble Rules from bin 5 pos1: prot_loc(…) . 12 . . . .

  34. Precision Recall 1.0 0.05 1.00 0.05 0.50 0.10 0.66 Precision 0.85 0.12 0.90 0.13 Recall 1.0 0.90 0.12 Gleaner - Ensemble Examples Score pos3: prot_loc(…) 55 neg28: prot_loc(…) 52 pos2: prot_loc(…) 47 . neg4: prot_loc(…) 18 neg475: prot_loc(…) 17 pos9: prot_loc(…) 17 neg15: prot_loc(…) 16 .

  35. Precision Recall Gleaner - Overlap • For each bin, take the topmost curve

  36. How to use Gleaner • Generate Test Curve • User Selects Recall Bin • Return ClassificationsOrdered By Their Score Precision Recall = 0.50 Precision = 0.70 Recall

  37. Aleph Ensembles • We compare to ensembles of theories • Algorithm (Dutra et al ILP 2002) • Use K different initial seeds • Learn K theories containing C rules • Rank examples by the number of theories • Need to balance C for high performance • Small C leads to low recall • Large C leads to converging theories

  38. Evaluation Metrics 1.0 • Area Under Recall-Precision Curve (AURPC) • All curves standardized to cover full recall range • Averaged AURPC over 5 folds • Number of clauses considered • Rough estimate of time Precision Recall 1.0

  39. YPD Protein Localization • Hand-labeled dataset(Ray & Craven ’01) • 7,245 sentences from 871 abstracts • Examples are phrase-phrase combinations • 1,810 positive & 279,154 negative • 1.6 GB of background knowledge • Structural, Statistical, Lexical and Ontological • In total, 200+ distinct background predicates

  40. Experimental Methodology • Performed five-fold cross-validation • Variation of parameters • Gleaner (20 recall bins) • # seeds = {25, 50, 75, 100} • # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} • Ensembles (0.75 minacc, 1K and 35K nodes) • # theories = {10, 25, 50, 75, 100} • # clauses per theory = {1, 5, 10, 15, 20, 25, 50}

  41. PR Curves - 100,000 Clauses

  42. PR Curves - 1,000,000 Clauses

  43. Protein Localization Results

  44. Genetic Disorder Results

  45. Current Directions • Learn diverse rules across seeds • Calculate probabilistic scores for examples • Directed Rapid Random Restarts • Cache rule information to speed scoring • Transfer learning across seeds • Explore Active Learning within ILP

  46. Take-Home Message • Biology, Gleaner and ILP • Challenging problems in biology can be naturally formulated for Inductive Logic Programming • Many rules constructed and evaluated in ILP hypothesis search • Gleaner makes use of those rules that are not the highest scoring ones for improved speed and performance

  47. Acknowledgements • USA DARPA Grant F30602-01-2-0571 • USA Air Force Grant F30602-01-2-0571 • USA NLM Grant 5T15LM007359-02 • USA NLM Grant 1R01LM07050-01 • UW Condor Group • David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni

More Related