1 / 35

Automating Discovery from Biomedical Texts

Automating Discovery from Biomedical Texts. Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000. UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques for extracting propositions.

idola-carr
Download Presentation

Automating Discovery from Biomedical Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000

  2. UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques for extracting propositions The LINDI ProjectLinking Information for New Discoveries Two Main Thrusts:

  3. Scenario: Explore Functions of a Gene • Objective • Determine the functions of a newly sequenced Gene X. • Known facts • Gene X co-expresses (activated in the same cell) with Gene A, B, C • The relationship of Gene A, B, C with certain types of diseases (from medical literature) • Question • What types of diseases are Gene X related to?

  4. Gene Co-expression:Role in the genetic pathway Kall. Kall. g? h? PSA PSA PAP PAP g? Other possibilities as well

  5. Make use of the literature • Look up what is known about the other genes. • Different articles in different collections • Look for commonalities • Similar topics indicated by Subject Descriptors • Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...

  6. Developing Strategies • Different strategies seem needed for different situations • First: see what is known about Kallikrein. • 7341 documents. Too many • AND the result with “disease” category • If result is non-empty, this might be an interesting gene • Now get 803 documents

  7. Gene-A Keywords Explore Functions of New Gene X Medical Literature Query Projection Mapping Slide adapted from K. Patel

  8. Developing Strategies • Different strategies seem needed for different situations • First: see what is known about Kallikrein. • 7341 documents. Too many • AND the result with “disease” category • If result is non-empty, this might be an interesting gene • Now get 803 documents • AND the result with PSA • Get 11 documents. Better!

  9. Gene-A Gene-B Gene-C Keywords Keywords Keywords Keywords Explore Functions of New Gene X Medical Literature Query Projection Intersection

  10. Developing Strategies • Look for commalities among these documents • Manual scan through ~100 category labels • Would have been better if • Automatically organized • Intersections of “important” categories scanned for first

  11. Gene-A Gene-B Gene-C Keywords Keywords Keywords Keywords Keywords Keywords Explore Functions of New Gene X Medical Literature Query Projection Intersection Slicing Mapping Slide adapted from K. Patel

  12. Try a new tack • Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests • New tack: intersect search on all three known genes • Hope they all talk about diagnostics and prostate cancer • Fortunately, 7 documents returned • Bingo! A relation to regulation of this cancer

  13. Gene-A Gene-B Gene-C Keywords Keywords Keywords Keywords Keywords Keywords Explore Functions of New Gene X Medical Literature Possible Function For Gene-X Query Query Projection Intersection Slicing Mapping Slide adapted from K. Patel

  14. Formulate a Hypothesis • Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer • New tack: do some lab tests • See if mystery gene is similar in molecular structure to the others • If so, it might do some of the same things they do

  15. Strategies again • In hindsight, combining all three genes was a good strategy. • Store this for later • Might not have worked • Need a suite of strategies • Build them up via experience and a good UI

  16. The System • Doing the same query with slightly different values each time is time-consuming and tedious • Same goes for cutting and pasting results • IR systems don’t support varying queries like this very well. • Each situation is a bit different • Some automatic processing is needed in the background to eliminate/suggest hypotheses

  17. The User Interface • A general search interface should support • History • Context • Comparison • Operators: Intersection, Union, Slicing • Operator Reuse • Visualization (where appropriate) • We have an initial implementation • It needs lots of work

  18. Architecture of LINDI UI • Data Layer • Annotation Layer • User Interface Layer

  19. Data Layer • Purpose • Hide different formats of text collections • Components • Data: Abstractions representing records of a text collection • Operations: performed on the data • Data • A set of records • Each record is a set of tuples with types • Operations • union, intersection, projection, mapping

  20. Annotation Layer • Purpose • Associate data set with operations that produced them (history) • History is a first class object • Advantage • Streamline a sequence of operations • Reuse operations • Parameterize operations

  21. User Interface • Direct manipulation of information objects and access operations • Query • Intersection • Union • Mapping • Slicing • Record and reuse of past operations • Parameterization of operations • Streamlining of operations

  22. Initial Palette

  23. Query Structure Determined by Collection Type

  24. Query Operation Results

  25. Projection Operation and Subsequent Results

  26. Parameterized Query: Repeat operations with different values GA GB GC

  27. Intersection over Projected Attribute

  28. Intersection over Projected Attribute

  29. Example Interaction with UI Prototype 1 Query on Gene names 2 Project out only mesh headings 3 Intersect the results 4 Map to create a ranking 5 Slice out the top-ranked.

  30. Future Work on UI • As currently designed • Better labeling • Better layout • Intuitive • Scalable • Connection to real backend • User Testing • Does direct manipulation work? • What operator sequences help? • How to improve parameterization? • More advanced • Support for strategies • Incorporation of NLP

  31. Language Analysis Component Goals: • Extract Propositions from Text • Make Inferences

  32. Language Analysis Component Why Extract Propositions from Text? • Text is how knowledge at the propositional level is communicated • Text is continually being created and updated by the outside world

  33. Example:Statistical Semantic Grammar To detect causal relationships between medical concepts • Title: Magnesium deficiency implicated in increased stress levels. • Interpretation: <nutrient><reduction> related-to <increase><symptom> • Inference: • Increase(stress, decrease(mg))

  34. Statistical Semantic Grammars • Empirical NLP has made great strides • But mainly applied to syntactic structure • Semantic grammars are powerful, but • Brittle • Time-consuming to construct • Idea: • Use what we now know about statistical NLP to build up a probabilistic grammar

  35. LINDI: Target Components • Special UI for retrieving appropriate docs • Language analysis on docs to detect causal relationships between concepts • Probabilistic representation of concepts and relationships • UI + User: Hypothesis creation

More Related