1 / 21

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. A. S. Schwartz & M. A. Hearst UC Berkeley Presented by Jing Jiang. The Problem – to Identify Acronyms. To identify <“short form”, “long form”> pairs from biomedical text:

cruz
Download Presentation

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A. S. Schwartz & M. A. Hearst UC Berkeley Presented by Jing Jiang

  2. The Problem – to Identify Acronyms • To identify <“short form”, “long form”> pairs from biomedical text: • Short form is abbreviation of long form • There exists character mapping from short form to long form • Example: • Gcn5-related N-acetyltransferase (GNAT) • A non-trivial problem: • Words in long form may be skipped • Internal letters in long form may be used

  3. Previous Work • Machine learning approach • Linear regression (Chang et al.) • Encoding and compression (Yeates et al.) • Heuristic approach • Rule-based • Factors considered include: • Distance between definition and abbreviation • Number of stop words • Capitalization

  4. Step 1: Identifying Candidates • Consider only two cases: • long form ‘(‘ short form ‘)’ • short form ‘(‘ long form ‘)’ • Short form: • No more than 2 words • Between 2 and 10 chars • At least one letter • First char alphanumeric • Long form: • Adjacent to short form • No more than min(|A| + 5, |A| * 2) words

  5. Step 2: Identifying Correct Long Forms • From right to left, the shortest long form that matches the short form: • Each character in short form must match a character in long form • The match of the character at the beginning of the short form must match a character in the initial position of the first word in the long form

  6. Java Code for Finding the Best Long Form for a Given Short Form

  7. Evaluation • 1000 randomly selected MEDLINE abstracts • 82% recall, 95% precision • Medstract Gold Standard Evaluation Corpus • 82% recall, 96% precision • Compared with • 83% recall, 80% precision (Cheng et al., linear regression) • 72% recall, 98% precision (Pustejovsky et al., heuristics)

  8. Missing Pairs • Skipped characters in short form • <CNS1, cyclophilin seven suppressor> • No match • <5-HT, serotonin> • Out of order • <ATN, anterior thalamus> • Partial match • <Pol I, RNA polymerase I>

  9. Discussion • Cons: • Simple method • Decent performance • Questions: • Tradeoff between complexity of rules and performance • Generality of the heuristic rules • Heuristics vs. machine learning

  10. Mining MEDLINE for Implicit Links between Dietary Substances and Diseases P. Srinivasan & B. Libbus U. Iowa Presented by Jing Jiang

  11. The Goal – to Discover Implicit Links between Topics • Open discovery • Start from topic A • Navigate through intermediate topics B1, B2, etc. • Reach terminal topics C1, C2, etc. • Closed discovery • Start from topics A and C • Find connections B1, B2, etc.

  12. General model for discovering implicit links between topics

  13. Terminology • Topic Profile: a set of terms that are highly related to the topic, together with weights assigned to each term • MeSH: Medical Subject Heading • UMLS types: Unified Medical Language System semantic types

  14. Open Discovery Algorithm • Input: • Topic A • Two sets of UMLS types ST-B & ST-C • Threshold M • Output: • Terms related to A and of some type in ST-C

  15. Open Discovery Algorithm (cont.) • Build topic A’s profile AP • For each type in ST-B, select M top terms B1, B2, etc. from AP • Build Bi’s profiles BPi • Build combined profile CP from BPs limited to types in ST-C • Remove terms directly linked to A from CP

  16. Building Profile for Topic A • Search PubMed for A • Extract MeSH terms from relevant documents • Compute TF * IDF • TF: # occurrences of the term in retrieved document set • IDF: log(N/TF) • N: # retrieved documents • Normalize the weight vector

  17. Testing with Turmeric • Topic A: Turmeric • ST-B: • Gene or Genome • Enzyme • Amino Acid, Peptide or Protein • ST-C: • Body Part, Organ or Organ Component • Disease or Syndrome • Neoplastic Process • M: 5, 10, 15

  18. Results • B terms: • 37% recall, 38% precision (compared with manually identified terms) • C terms: • 67% recall, 67% precision (compared with manual results)

  19. Novel C MeSH Terms

  20. Discussion • Cons: • Simple method • Domain knowledge (MeSH terms, UMLS types) to shape search direction • Questions: • TF & IDF? • Longer path? • What relationships? • Co-occurrence = link?

  21. End of Talk

More Related