1 / 104

Semantic Relation Detection in Bioscience Text

Semantic Relation Detection in Bioscience Text. Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech. BioText Project Goals. Provide flexible, intelligent access to information for use in biosciences applications. Focus on

Download Presentation

Semantic Relation Detection in Bioscience Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Relation Detectionin Bioscience Text Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

  2. BioText Project Goals • Provide flexible, intelligent access to information for use in biosciences applications. • Focus on • Textual Information from Journal Articles • Tightly integrated with other resources • Ontologies • Record-based databases

  3. Project Team • Project Leaders: • PI: Marti Hearst • Co-PI: Adam Arkin • Computational Linguistics • Barbara Rosario • Presley Nakov • Database Research • Ariel Schwartz • Gaurav Bhalotia (graduated) • User Interface / IR • Adam Newberger • Dr. Emilia Stoica • Bioscience • Dr. TingTing Zhang • Janice Hamerja Supported primarily by NSF DBI-0317510 and a gift from Genentech

  4. BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

  5. The Nature of Bioscience Text Claim: Bioscience semantics are simultaneously easier and harder than general text. easier harder Fewer subtleties Fewer ambiguities “Systematic” meanings Enormous terminology Complex sentence structure

  6. Sample Sentence “Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1-p53 complex formation [70].”

  7. BioScience Researchers • Read A LOT! • Cite A LOT! • Curate A LOT! • Are interested in specific relations, e.g.: • What is the role of this protein in that pathway? • Show me articles in which a comparison between two values is significant.

  8. This Talk • Discovering semantic relations • Between nouns in noun compounds • Between entities in sentences • Acquiring labeled data: • Idea: use text surrounding citations to documents to identify paraphrases • A new direction; preliminary work only

  9. Noun CompoundRelation Recognition

  10. Noun Compounds (NCs) • Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment. • NC is any sequence of nouns that itself functions as a noun • asthma hospitalizations • health care personnel hand wash

  11. NCs: 3 computational tasks • Identification • Syntactic analysis (attachments) • [Baseline [headachefrequency]] • [[Tensionheadache] patient] • Our Goal: Semantic analysis • Headache treatment treatment for headache • Corticosteroid treatment treatment that uses corticosteroid

  12. Descent of Hierarchy • Idea: • Use the top levels of a lexical hierarchy to identify semantic relations • Hypothesis: • A particular semantic relation holds between all 2-word NCs that can be categorized by a lexical category pair.

  13. Related work(Semantic analysis of NCs) • Rule-based • Finin (1980) • Detailed AI analysis, hand-coded • Vanderwende (1994) • automatically extracts semantic information from an on-line dictionary, manipulates a set of handwritten rules. 13 classes, 52% accuracy • Probabilistic • Lauer (1995): • probabilistic model, 8 classes, 47% accuracy • Lapata (2000) • classifies nominalizations into subject/object. 2 classes, 80% accuracy

  14. Related work(Semantic analysis of NCs) • Lexical Hierarchy • Barrett et al. (2001) • WordNet, heuristics to classify a NC given the similarity to a known NC • Rosario and Hearst (2001) • Relations pre-defined • MeSH, Neural Network. 18 classes, 60% accuracy

  15. Linguistic Motivation Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. • (used-in): kitchen knife • (made-of): steel knife • (instrument-for): carving knife • (used-on): putty knife • (used-by): butcher’s knife

  16. The lexical Hierarchy: MeSH 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]

  17. The lexical Hierarchy: MeSH 1. Anatomy [A]Body Regions [A01] 2. [B] Musculoskeletal System [A02] 3. [C] Digestive System [A03] 4. [D] Respiratory System [A04] 5. [E] Urogenital System [A05] 6. [F] …… 7. [G] 8. Physical Sciences [H] 9. [I] 10. [J] 11. [K] 12. [L] 13. [M]

  18. Descending the Hierarchy 1. Anatomy [A]Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] 9. [I] 10. [J] 11. [K] 12. [L] 13. [M]

  19. Descending the Hierarchy 1. Anatomy [A]Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] Electronics 9. [I] Astronomy 10. [J] Nature 11. [K] Time 12. [L] Weights and Measures 13. [M] ….

  20. Descending the Hierarchy 1. Anatomy [A]Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] Electronics Amplifiers 9. [I] Astronomy Electronics, Medical 10. [J] Nature Transducers 11. [K] Time 12. [L] Weights and Measures 13. [M] ….

  21. Descending the Hierarchy 1. Anatomy [A]Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] Electronics Amplifiers 9. [I] Astronomy Electronics, Medical 10. [J] Nature Transducers 11. [K] Time 12. [L] Weights and Measures Calibration 13. [M] ….Metric System Reference Standard

  22. Descending the Hierarchy 1. Anatomy [A]Body Regions [A01]Abdomen [A01.047] 2. [B] Musculoskeletal System [A02]Back [A01.176] 3. [C] Digestive System [A03]Breast [A01.236] 4. [D] Respiratory System [A04]Extremities [A01.378] 5. [E] Urogenital System [A05]Head [A01.456] 6. [F] ……Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] ElectronicsAmplifiers 9. [I] AstronomyElectronics, Medical 10. [J] NatureTransducers 11. [K] Time 12. [L] Weights and Measures Calibration 13. [M] ….Metric System Reference Standard Homogeneous Heterogeneous

  23. Mapping Nouns to MeSH Concepts • headache recurrence C23.888.592.612.441 C23.550.291.937 • headache pain C23.888.592.612.441 G11.561.796.444

  24. Levels of Description headache pain • Level 0: C.23 G.11 • Level 1: C23.888 G11.561 • Level 1: C23.888.592 G11.561.796 • … • Original: C23.888.592.612.441 G11.561.796.444

  25. Descent of Hierarchy • Idea: • Words falling in homogeneous MeSH subhierarchies behave “similarly” with respect to relation assignment • Hypothesis: • A particular semantic relation holds between all 2-word NCs that can be categorized by a MeSH category pairs

  26. Grouping the NCs • CP: A02 C04 (Musculoskeletal System, Neoplasms) • skull tumors, bone cysts, bone metastases, skull osteosarcoma… • CP: C04 M01 (Neoplasms, Person) • leukemia survivor, lymphoma patients, cancer physician, cancer nurses…

  27. Distribution of Category Pairs

  28. Collection • ~70,000 NCs extracted from titles and abstracts of Medline • 2,627 CPs at level 0 (with at least 10 unique NCs) • We analyzed • 250 CPs with Anatomy (A) • 21 CPs with Natural Science (H01) • 3 CPs with Neoplasm (C04) • This represents 10% of total CPs and 20% of total NCs

  29. Classification Method • For each CP • Divide its NCs into “training-testing” sets • “Training”: inspect NCs by hand • Start from level 0 0 • While NCs are not all similar • descend one level of the hierarchy • Repeat until all NCs for that CP are similar

  30. Classification Decisions • A02 C04 • B06 B06 • C04 M01 • C04 M01.643 • C04 M01.526 • A01 H01 • A01 H01.770 • A01 H01.671 • A01 H01.671.538 • A01 H01.671.868 • A01 M01 • A01 M01.643 • A01 M01.526 • A01 M01.898

  31. Classification Decisions + Relations • A02 C04  Location of Disease • B06 B06  Kind of Plants • C04 M01 • C04 M01.643  Person afflicted by Disease • C04 M01.526  Person who treats Disease • A01 H01 • A01 H01.770 • A01 H01.671 • A01 H01.671.538 • A01 H01.671.868 • A01 M01 • A01 M01.643 • A01 M01.526 • A01 M01.898

  32. Classification Decisions + Relations • A02 C04  Location of Disease • B06 B06  Kind of Plants • C04 M01 • C04 M01.643  Person afflicted by Disease • C04 M01.526  Person who treats Disease • A01 H01 • A01 H01.770 • A01 H01.671 • A01 H01.671.538 • A01 H01.671.868 • A01 M01 • A01 M01.643  Person afflicted by Disease • A01 M01.526 • A01 M01.898

  33. Classification Decision Levels • Anatomy: 250 CPs • 187 (75%) remain first level • 56 (22%) descend one level • 7 (3%) descend two levels • Natural Science (H01): 21 CPs • 1 ( 4%) remain first level • 8 (39%) descend one level • 12 (57%) descend two levels • Neoplasms (C04) 3 CPs: • 3 (100%) descend one level

  34. Evaluation • Test the decisions on “testing” set • Count how many NCs that fall in the groups defined in the classification decisions are similar to each other • Accuracy (for 2nd noun): • Anatomy: 91% • Natural Science: 79% • Neoplasm: 100% • Total Accuracy : 90.8% • Generalization: our 415 classification decisions cover ~ 46,000 possible CP pairs

  35. Ambiguity – Two Types • Lexical ambiguity: • mortality • state of being mortal • death rate • Relationship ambiguity: • bacteria mortality • death of bacteria • death caused by bacteria

  36. Four Cases Single MeSH senses Multiple MeSH senses Only one possible relationship: abdomen radiography, aciclovir treatment Only one possible relationship: alcoholism treatment Multiple relationships: hospital databases, education efforts, kidney metabolism Multiple relationships bacteria mortality Ambiguity of relationship

  37. Four Cases Single MeSH senses Multiple MeSH senses Only one possible relationship: abdomen radiography, aciclovir treatment Only one possible relationship: alcoholism treatment Multiple relationships bacteria mortality Multiple relationships: hospital databases, education efforts, kidney metabolism Most problematic cases Ambiguity of relationship … but rare!

  38. Conclusions on NN Relation Classification • Very simple method for assigning semantic relations to two-word technical NCs • 90.8% accuracy • Lexical resource (MeSH) useful for this task • Probably works because of the relative lack of ambiguity in this kind of technical text.

  39. Entity-EntityRelation Recognition

  40. Treatment Disease Problem: Which relations hold between 2 entities? Cure? Prevent? Side Effect?

  41. Hepatitis Examples • Cure • These results suggest that con A-induced hepatitis was ameliorated by pretreatment withTJ-135. • Prevent • A two-dose combined hepatitis A and B vaccinewould facilitate immunization programs • Vague • Effect ofinterferonon hepatitis B

  42. Two tasks • Relationship Extraction: • Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text • Entity extraction: • Related problem: identify such entities

  43. The Approach • Data: MEDLINE abstracts and titles • Graphical models • Combine in one framework both relation and entity extraction • Both static and dynamic models • Simple discriminative approach: • Neural network • Lexical, syntactic and semantic features

  44. Related Work • We allow several DIFFERENT relations between the same entities • Thus differs from the problem statement of other work on relations • Many find one relation which holds between two entities (many based on ACE) • Agichtein and Gravano (2000), lexical patterns for location of • Zelenko et al. (2002) SVM for person affiliation and organization-location • Hasegawa et al. (ACL 2004) Person-Organization -> President “relation” • Craven (1999, 2001) HMM for subcellular-location and disorder-association • Doesn’t identify the actual relation

  45. Related work: Bioscience • Many hand-built rules • Feldman et al. (2002), • Friedman et al. (2001) • Pustejovsky et al. (2002) • Saric et al.; this conference

  46. Data and Relations • MEDLINE, abstracts and titles • 3662 sentences labeled • Relevant: 1724 • Irrelevant: 1771 • e.g., “Patients were followed up for 6 months” • 2 types of Entities, many instances • treatment and disease • 7 Relationships between these entities

  47. Semantic Relationships • 810: Cure • Intravenous immune globulin for recurrent spontaneous abortion • 616: Only Disease • Social ties and susceptibility to the common cold • 166: Only Treatment • Flucticasone propionate is safe in recommended doses • 63: Prevent • Statins for prevention of stroke

  48. Semantic Relationships • 36: Vague • Phenylbutazone and leukemia • 29: Side Effect • Malignant mesodermal mixed tumor of the uterus following irradiation • 4: Does NOT cure • Evidence for double resistance to permethrin and malathion in head lice

  49. Features • Word • Part of speech • Phrase constituent • Orthographic features • ‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ … • MeSH (semantic features) • Replace words, or sequences of words, with generalizations via MeSH categories • Peritoneum -> Abdomen

  50. Models • 2 static generative models • 3 dynamic generative models • 1 discriminative model (neural network)

More Related