A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A. S. Schwartz & M. A. Hearst UC Berkeley Presented by Jing Jiang

The Problem – to Identify Acronyms • To identify <“short form”, “long form”> pairs from biomedical text: • Short form is abbreviation of long form • There exists character mapping from short form to long form • Example: • Gcn5-related N-acetyltransferase (GNAT) • A non-trivial problem: • Words in long form may be skipped • Internal letters in long form may be used

Previous Work • Machine learning approach • Linear regression (Chang et al.) • Encoding and compression (Yeates et al.) • Heuristic approach • Rule-based • Factors considered include: • Distance between definition and abbreviation • Number of stop words • Capitalization

Step 1: Identifying Candidates • Consider only two cases: • long form ‘(‘ short form ‘)’ • short form ‘(‘ long form ‘)’ • Short form: • No more than 2 words • Between 2 and 10 chars • At least one letter • First char alphanumeric • Long form: • Adjacent to short form • No more than min(|A| + 5, |A| * 2) words

Step 2: Identifying Correct Long Forms • From right to left, the shortest long form that matches the short form: • Each character in short form must match a character in long form • The match of the character at the beginning of the short form must match a character in the initial position of the first word in the long form

Java Code for Finding the Best Long Form for a Given Short Form

Evaluation • 1000 randomly selected MEDLINE abstracts • 82% recall, 95% precision • Medstract Gold Standard Evaluation Corpus • 82% recall, 96% precision • Compared with • 83% recall, 80% precision (Cheng et al., linear regression) • 72% recall, 98% precision (Pustejovsky et al., heuristics)

Missing Pairs • Skipped characters in short form • <CNS1, cyclophilin seven suppressor> • No match • <5-HT, serotonin> • Out of order • <ATN, anterior thalamus> • Partial match • <Pol I, RNA polymerase I>

Discussion • Cons: • Simple method • Decent performance • Questions: • Tradeoff between complexity of rules and performance • Generality of the heuristic rules • Heuristics vs. machine learning

Mining MEDLINE for Implicit Links between Dietary Substances and Diseases P. Srinivasan & B. Libbus U. Iowa Presented by Jing Jiang

The Goal – to Discover Implicit Links between Topics • Open discovery • Start from topic A • Navigate through intermediate topics B1, B2, etc. • Reach terminal topics C1, C2, etc. • Closed discovery • Start from topics A and C • Find connections B1, B2, etc.

General model for discovering implicit links between topics

Terminology • Topic Profile: a set of terms that are highly related to the topic, together with weights assigned to each term • MeSH: Medical Subject Heading • UMLS types: Unified Medical Language System semantic types

Open Discovery Algorithm • Input: • Topic A • Two sets of UMLS types ST-B & ST-C • Threshold M • Output: • Terms related to A and of some type in ST-C

Open Discovery Algorithm (cont.) • Build topic A’s profile AP • For each type in ST-B, select M top terms B1, B2, etc. from AP • Build Bi’s profiles BPi • Build combined profile CP from BPs limited to types in ST-C • Remove terms directly linked to A from CP

Building Profile for Topic A • Search PubMed for A • Extract MeSH terms from relevant documents • Compute TF * IDF • TF: # occurrences of the term in retrieved document set • IDF: log(N/TF) • N: # retrieved documents • Normalize the weight vector

Testing with Turmeric • Topic A: Turmeric • ST-B: • Gene or Genome • Enzyme • Amino Acid, Peptide or Protein • ST-C: • Body Part, Organ or Organ Component • Disease or Syndrome • Neoplastic Process • M: 5, 10, 15

Results • B terms: • 37% recall, 38% precision (compared with manually identified terms) • C terms: • 67% recall, 67% precision (compared with manual results)

Novel C MeSH Terms

Discussion • Cons: • Simple method • Domain knowledge (MeSH terms, UMLS types) to shape search direction • Questions: • TF & IDF? • Longer path? • What relationships? • Co-occurrence = link?

End of Talk

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

Presentation Transcript

DC-Text: a simple text-based format for DC metadata

Identifying Abbreviation Definitions in Biomedical Text

Disambiguation of Biomedical Text

Text Mining in Biomedical Research

A Simple Genetic Algorithm for Function Optimization

Connecting Pieces in a Text: Strategies for Identifying Inferences

Abbreviation

Some Simple Definitions for Testing

DISC-Finder: A distributed algorithm for identifying galaxy clusters

Biomedical Text Analysis

A Genetic Algorithm for Cipher text-Only Attack in Cryptanalysis

A Full-Text Search Algorithm for Long Queries

A Discriminative Alignment Model for Abbreviation Recognition

Biomedical text mining

A Genetic Algorithm for Text Classification Rule Induction

A simple dummy text

Research Opportunities in Biomedical Text Mining

Identifying Comparative Sentences in Text Documents

A Simple Min-Cut Algorithm

Some Simple Definitions for Testing

Identifying Abbreviation Definitions in Biomedical Text

A Genetic Algorithm for Text Classification Rule Induction