1 / 37

Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text

Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text. Mark Craven Department of Biostatistics & Medical Informatics Department of Computer Sciences University of Wisconsin U.S.A. craven@biostat.wisc.edu www.biostat.wisc.edu/~craven.

giolla
Download Presentation

Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics Department of Computer Sciences University of Wisconsin U.S.A. craven@biostat.wisc.edu www.biostat.wisc.edu/~craven

  2. The Information Extraction Task Analysis of Yeast PRP20 Mutations and Functional Complementation by the Human Homologue RCC1, a Protein Involved in the Control of Chromosome Condensation Fleischmann M, Clark M, Forrester W, Wickens M, Nishimoto T, Aebi M Mutations in the PRP20 gene of yeast show a pleitropic phenotype, in which both mRNA metabolishm and nuclear structure are affected. SRM1 mutants, defective in the same gene, influence the signal transduction pathway for the pheromone response . . . By immunofluorescence microscopy the PRP20 protein was localized in the nucleus. Expression of the RCC1 protein can complement the temperature-sensitive phenotype of PRP20 mutants, demonstrating the functional similarity of the yeast and mammalian proteins protein(PRP20) subcellular-localization(PRP20, nucleus)

  3. Motivation • assisting in the construction and updating of databases • providing structured summaries for queries What is known about protein X (subcellular & tissue localization, associations with diseases, interactions with drugs, …)? • assisting scientific discovery by detecting previously unknown relationships, annotating experimental data

  4. Three Themes in Our IE Research • Using “weakly” labeled training data • Representing sentence structure in learned models • Combining evidence when making predictions

  5. 1. Using “Weakly” Labeled Data • why use machine learning methods in building information-extraction systems? • hand-coding IE systems is expensive, time-consuming • there is a lot of data that can be leveraged • where do we get a training set? • by having someone hand-label data (expensive) • by coupling tuples in an existing database with relevant documents (cheap)

  6. “Weakly” Labeled Training Data • to get positive examples, match DB tuples to passages of text referencing constants in tuples YPD database MEDLINE abstracts P1, L1 …P1…L1… P2, L2 P3, L3 …P2…L2… …P1…L1… …L3…P3…

  7. VAC8p is a 64-kD protein found on the vacuole membrane, a site consistent with its role in vacuole inheritance. In analogy, VAC8p may link the vacuole to actin during vacuole partitioning. In addition to its role in early vacuole inheritance, VAC8p is required to target aminopeptidase I from the cytoplasm to the vacuole. Weakly Labeled Training Data • the labeling is weak in that many sentences with co-occurrences wouldn’t be considered positive examples if we were hand-labeling them • consider the sentences associated with the relation subcellular-localization(VAC8p, vacuole) after weak labeling

  8. selections from the training corpus …gene encoding <p>gamma-glutamyl kinase</p> was… …recognized genes encoding <p>vimentin</p>, heat… …found that <p>E2F</p> binds specifically… …<p>IleRS</p> binds to the acceptor… …of <p>CPB II</p> binds 1 mol of… …purified C/<p>EBP</p> binds at the same position… …which interacts with <p>CD4</p>: both… …14-3-3tau interacts with <p>protein kinase C mu</p>, a subtype… encoding [X]2/4 [X] binds 4/5 interacts with [X]2/6 Learning Context Patterns for Recognizing Protein Names • We use AutoSlog [Riloff ’96] to find “triggers” that commonly occur before and after tagged proteins in a training corpus

  9. “Weak” Labeling Example SwissProt dictionary PubMed abstract ... D-AKAP-2 D-amino acid oxidase D-aspartate oxidase D-dopachrome tautomerase … DAG kinase zeta DAMOX DASOX DAT DB83 protein … Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human tissues: <p>D-amino acid oxidase</p> and <p>D-aspartate oxidase</p>. The enzymes differ in their electrophoretic properties, tissue distribution, binding with flavine adenine denucleotide, heat stability, molecular size and possibly in subunit structure. Neither enzyme exhibits genetic polymorphism in European populations, but a rare electrophoretic variant phenotype (<p>DASOX</p>2-1) was identified which suggests that the <p>DASOX</p> locus is autosomal and independent of the <p>DAMOX</p> locus.

  10. Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human tissues: D-amino acid oxidase and select noun phrases that match Autoslog patterns classify noun phrases using a naïve Bayes model encoding [X] [X] binds interacts with [X] extract positive classifications D-amino acid oxidase Protein Name Extraction Approach

  11. Experimental Evaluation • hypothesis: we get more accurate models by using weakly labeled data in addition to manually labeled data • models use Autoslog-induced context patterns + naïve Bayes on morphological/syntax features of candidate names • compare predictive accuracy resulting from • fixed amount of hand-labeled data • varying amounts of weakly labeled data + hand-labeled data

  12. Extraction Accuracy: Yapex Data Set

  13. Extraction Accuracy: Texas Data Set

  14. 2. Representing Sentence Structure in Learned Models • hidden Markov models (HMMs) have proven to be perhaps the best family of methods for learning IE models • typically these HMMs have a “flat” structure, and are able to represent relatively little about grammatical structure • how can we provide HMMs with more information about sentence structure?

  15. the .00001 protein .00002 ... bed1 .001 the .001 protein .00005 ... .1 .1 the .00001 protein .00002 ... bed1 .001 the .001 protein .00005 ... q4 q4 .1 .1 .2 .2 .1/.8 .1/.8 .8 .8 .4 .4 start start q1 q1 q3 q3 end end 1 1 .3 .3 .1 .1 q2 q2 .2 .2 the .0001 protein .03 ...... the .007 protein .02 ... the .0001 protein .03 ...... the .007 protein .02 ... .4 .4 .1 .1 .3/.6 .3/.6 .2 .2 q5 q5 the .0001 protein .0003 ... the .0001 protein .0003 ... Hidden Markov Models: Example Pr(“... the Bed1 protein ...” | ... q1,q4,q2 ...)

  16. Hidden Markov Models for Information Extraction • there are efficient algorithms for doing the following with HMMs: • determining the likelihood of a sentence given a model • determining the most likely path through a model for a sentence • setting the parameters of the model to maximize the likelihood of a set of sentences

  17. sentence clause clause verb phrase noun phrase verb phrase noun phrase prep phrase c_m noun phrase prep art unk adjective noun c_m noun unk cop verb verb is found the ER Our results suggest that Bed1 in protein Representing Sentences • we first process sentences by analyzing them with a shallow parser (Sundance, [Riloff et al., 98])

  18. NP-SEGMENT PROTEIN NP-SEGMENT END START PREP LOCATION NP-SEGMENT Hierarchical HMMs for IE(Part 1) • [Ray & Craven, IJCAI 01; Skounakis et al, IJCAI 03] • states have types, emit phrases • some states have labels (PROTEIN, LOCATION) • our models have  25 states at this level

  19. NP-SEGMENT PROTEIN NP-SEGMENT END START PREP LOCATION NP-SEGMENT NP-SEGMENT START END PREP Hierarchical HMMs for IE (Part 2) positive model null model

  20. PREP PROTEIN NP-SEGMENT END START NP-SEGMENT LOCATION NP-SEGMENT BEFORE LOCATION START AFTER END START ALL END BETWEEN Pr(the) = 0.0003 Pr(and) = 0.0002 … Pr(cell) = 0.0001 Hierarchical HMMs for IE (Part 3)

  21. Hierarchical HMMs consider emitting: in PP-SEGMENT PROTEIN NP-SEGMENT END START “. . . is found in the ER” VP-SEGMENT LOCATION NP-SEGMENT the ER START ALL END BEFORE LOCATION START AFTER END BETWEEN is found

  22. NP-SEGMENT START END PP-SEGMENT Extraction with our HMMs • extract a relation instance if • sentence is more probable under positive model • Viterbi (most probable) path goes through special extraction states NP-SEGMENT PROTEIN NP-SEGMENT END START PP-SEGMENT LOCATION NP-SEGMENT

  23. Representing More Local Context • we can have the word-level states represent more about the local context of each emission • partition sentence into overlapping trigrams “... the/ART Bed1/UNK protein/N is/COP located/V ...”

  24. Representing More Local Context • states emit trigrams with probability: • note the independence assumption above: we compensate for this naïve assumption by using a discriminative training method [Krogh ’94] to learn parameters

  25. more grammatical information Experimental Evaluation • hypothesis: we get more accurate models by using a richer representation of sentence structure in HMMs • compare predictive accuracy of various types of representations • hierarchical w/context features • hierarchical • phrases • tokens w/part of speech • tokens • 5-fold cross validation on 3 data sets

  26. Weakly Labeled Data Sets for Learning to Extract Relations • subcellular_localization(PROTEIN, LOCATION) • YPD database • 769 positive, 6193 negative sentences • 939 tuples (402 distinct) • disorder_association(GENE, DISEASE) • OMIM database • 829 positive, 11685 negative sentences • 852 tuples (143 distinct) • protein_protein_interaction(PROTEIN, PROTEIN) • MIPS database • 5446 positive, 41377 negative • 8088 (819 distinct)

  27. Extraction Accuracy (YPD)

  28. Extraction Accuracy (MIPS)

  29. Extraction Accuracy (OMIM)

  30. 3. Combining Evidence when Making Predictions • in processing a large corpus, we are likely to see the same entities, relations in multiple places • in making extractions, we should combine evidence across different occurrences/contexts in we see some entity/relation

  31. Combining Evidence:Organizing Predictions into Bags occurrence predicted actual CAT is a 64-kD protein… …the cat activated the mouse... CAT was established to be… …were removed from cat brains.

  32. Combining Evidence when Making Predictions • given a bag of predictions, estimate the probability that the bag contains at least one actual positive example:

  33. can model with two binomial distributions based on estimated TP-rate, FP-rate of model can do something simple here (e.g. assume uniform priors) or can make estimate this from data w/ a few assumptions Combining Evidence:Estimating Relevant Probabilities

  34. Evidence Combination: Protein-Protein Interactions

  35. Evidence Combination: Protein Names

  36. Conclusions • machine learning methods provide a means for learning/refining models for information extraction • learning is inexpensive when unlabeled/weakly labeled sources can be exploited • learning context patterns for protein names • learning HMMs for relation extraction • we can learn more accurate models by giving HMMs more information about syntactic structure of sentences • hierarchical HMMs • we can improve the precision of our predictions by carefully combining evidence across extractions

  37. Acknowledgments my graduate students Soumya Ray Burr Settles Marios Skounakis NIH/NLM grant 1R01 LM07050-01 NSF CAREER grant IIS-0093016

More Related