1 / 6

Identifying Abbreviation Definitions in Biomedical Text

Identifying Abbreviation Definitions in Biomedical Text. Ariel Schwartz Marti Hearst. The Problem. The volume of biomedical text is growing at a fast rate. New abbreviations are introduced frequently. Manual abbreviation dictionaries are out of date.

Download Presentation

Identifying Abbreviation Definitions in Biomedical Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying Abbreviation Definitions in Biomedical Text Ariel Schwartz Marti Hearst

  2. The Problem • The volume of biomedical text is growing at a fast rate. New abbreviations are introduced frequently. • Manual abbreviation dictionaries are out of date. • The goal is to have a simple, fast and accurate algorithm to identify abbreviations and their definitions in biomedical text. • We are interested in this algorithm, as one of many preprocessing steps we apply to biomedical texts, in order to be able to extract meaningful information from these texts.

  3. Abbreviation Examples • “Heat-shock protein 40 (Hsp40) enables Hsp70 to play critical roles in a number of cellular processes, such as protein folding, assembly, degradation and translocation in vivo.” • “Glutathione S-transferase pull-down experiments showed the direct interaction of in vitro translated p110, p64, and p58 of the essential CBF3 kinetochore protein complex with Cbf1p, a basic region helix-loop-helix zipper protein (bHLHzip) that specifically binds to the CDEI region on the centromere DNA.” • “Hpa2 is a member of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes with diverse substrates including histones, other proteins,arylalkylamines and aminoglycosides.”

  4. Related Work • Pustejovsky et al. present a solution based on hand-build regular expression and syntactic information. Achieved 72% recall at 98% • Chang et al. use linear regression on a pre-selected set of features. Achieved 83% recall at 80%* precision, and 75% recall at 95% precision. • Park and Byrd present a rule-based algorithm for extraction of abbreviation definitions in general text. • Yoshida et al. present an approach close to ours, trying to first match characters on word and syllable boundaries. * Counting partial matches, and abbreviations missing from the “gold-standard” their algorithm achieved 83% recall at 98% precision.

  5. The Algorithm • Much simpler than other approaches. • Extracts abbreviation-definition candidates adjacent to parentheses. • Finds correct definitions by matching characters in the abbreviation to characters in the definition, starting from the right. • The first character in the abbreviation must match a character at the beginning of a word in the definition. • To increase precision a few simple heuristics are applied to eliminate incorrect pairs. • Example: Heat shock transcription factor (HSF). • The algorithm finds the correct definition, but not the correct alignment: Heat shock transcription factor

  6. Results • On the “gold-standard” the algorithm achieved 83% recall at 96% precision.* • On a larger test collection the results were 90% recall at 95% precision. • An alternative algorithm, based on modification of the Park and Byrd algorithm using decision lists, achieved only slightly better results – 83% recall at 97% precision, and 90% at 96% precision. • These results show that a very simple algorithm produces results that are comparable to these of the exiting more complex algorithms. * Counting partial matches, and abbreviations missing from the “gold-standard” our algorithm achieved 83% recall at 99% precision.

More Related