210 likes | 308 Views
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. A. S. Schwartz & M. A. Hearst UC Berkeley Presented by Jing Jiang. The Problem – to Identify Acronyms. To identify <“short form”, “long form”> pairs from biomedical text:
E N D
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A. S. Schwartz & M. A. Hearst UC Berkeley Presented by Jing Jiang
The Problem – to Identify Acronyms • To identify <“short form”, “long form”> pairs from biomedical text: • Short form is abbreviation of long form • There exists character mapping from short form to long form • Example: • Gcn5-related N-acetyltransferase (GNAT) • A non-trivial problem: • Words in long form may be skipped • Internal letters in long form may be used
Previous Work • Machine learning approach • Linear regression (Chang et al.) • Encoding and compression (Yeates et al.) • Heuristic approach • Rule-based • Factors considered include: • Distance between definition and abbreviation • Number of stop words • Capitalization
Step 1: Identifying Candidates • Consider only two cases: • long form ‘(‘ short form ‘)’ • short form ‘(‘ long form ‘)’ • Short form: • No more than 2 words • Between 2 and 10 chars • At least one letter • First char alphanumeric • Long form: • Adjacent to short form • No more than min(|A| + 5, |A| * 2) words
Step 2: Identifying Correct Long Forms • From right to left, the shortest long form that matches the short form: • Each character in short form must match a character in long form • The match of the character at the beginning of the short form must match a character in the initial position of the first word in the long form
Java Code for Finding the Best Long Form for a Given Short Form
Evaluation • 1000 randomly selected MEDLINE abstracts • 82% recall, 95% precision • Medstract Gold Standard Evaluation Corpus • 82% recall, 96% precision • Compared with • 83% recall, 80% precision (Cheng et al., linear regression) • 72% recall, 98% precision (Pustejovsky et al., heuristics)
Missing Pairs • Skipped characters in short form • <CNS1, cyclophilin seven suppressor> • No match • <5-HT, serotonin> • Out of order • <ATN, anterior thalamus> • Partial match • <Pol I, RNA polymerase I>
Discussion • Cons: • Simple method • Decent performance • Questions: • Tradeoff between complexity of rules and performance • Generality of the heuristic rules • Heuristics vs. machine learning
Mining MEDLINE for Implicit Links between Dietary Substances and Diseases P. Srinivasan & B. Libbus U. Iowa Presented by Jing Jiang
The Goal – to Discover Implicit Links between Topics • Open discovery • Start from topic A • Navigate through intermediate topics B1, B2, etc. • Reach terminal topics C1, C2, etc. • Closed discovery • Start from topics A and C • Find connections B1, B2, etc.
Terminology • Topic Profile: a set of terms that are highly related to the topic, together with weights assigned to each term • MeSH: Medical Subject Heading • UMLS types: Unified Medical Language System semantic types
Open Discovery Algorithm • Input: • Topic A • Two sets of UMLS types ST-B & ST-C • Threshold M • Output: • Terms related to A and of some type in ST-C
Open Discovery Algorithm (cont.) • Build topic A’s profile AP • For each type in ST-B, select M top terms B1, B2, etc. from AP • Build Bi’s profiles BPi • Build combined profile CP from BPs limited to types in ST-C • Remove terms directly linked to A from CP
Building Profile for Topic A • Search PubMed for A • Extract MeSH terms from relevant documents • Compute TF * IDF • TF: # occurrences of the term in retrieved document set • IDF: log(N/TF) • N: # retrieved documents • Normalize the weight vector
Testing with Turmeric • Topic A: Turmeric • ST-B: • Gene or Genome • Enzyme • Amino Acid, Peptide or Protein • ST-C: • Body Part, Organ or Organ Component • Disease or Syndrome • Neoplastic Process • M: 5, 10, 15
Results • B terms: • 37% recall, 38% precision (compared with manually identified terms) • C terms: • 67% recall, 67% precision (compared with manual results)
Discussion • Cons: • Simple method • Domain knowledge (MeSH terms, UMLS types) to shape search direction • Questions: • TF & IDF? • Longer path? • What relationships? • Co-occurrence = link?