60 likes | 83 Views
Identifying Abbreviation Definitions in Biomedical Text. Ariel Schwartz Marti Hearst. The Problem. The volume of biomedical text is growing at a fast rate. New abbreviations are introduced frequently. Manual abbreviation dictionaries are out of date.
E N D
Identifying Abbreviation Definitions in Biomedical Text Ariel Schwartz Marti Hearst
The Problem • The volume of biomedical text is growing at a fast rate. New abbreviations are introduced frequently. • Manual abbreviation dictionaries are out of date. • The goal is to have a simple, fast and accurate algorithm to identify abbreviations and their definitions in biomedical text. • We are interested in this algorithm, as one of many preprocessing steps we apply to biomedical texts, in order to be able to extract meaningful information from these texts.
Abbreviation Examples • “Heat-shock protein 40 (Hsp40) enables Hsp70 to play critical roles in a number of cellular processes, such as protein folding, assembly, degradation and translocation in vivo.” • “Glutathione S-transferase pull-down experiments showed the direct interaction of in vitro translated p110, p64, and p58 of the essential CBF3 kinetochore protein complex with Cbf1p, a basic region helix-loop-helix zipper protein (bHLHzip) that specifically binds to the CDEI region on the centromere DNA.” • “Hpa2 is a member of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes with diverse substrates including histones, other proteins,arylalkylamines and aminoglycosides.”
Related Work • Pustejovsky et al. present a solution based on hand-build regular expression and syntactic information. Achieved 72% recall at 98% • Chang et al. use linear regression on a pre-selected set of features. Achieved 83% recall at 80%* precision, and 75% recall at 95% precision. • Park and Byrd present a rule-based algorithm for extraction of abbreviation definitions in general text. • Yoshida et al. present an approach close to ours, trying to first match characters on word and syllable boundaries. * Counting partial matches, and abbreviations missing from the “gold-standard” their algorithm achieved 83% recall at 98% precision.
The Algorithm • Much simpler than other approaches. • Extracts abbreviation-definition candidates adjacent to parentheses. • Finds correct definitions by matching characters in the abbreviation to characters in the definition, starting from the right. • The first character in the abbreviation must match a character at the beginning of a word in the definition. • To increase precision a few simple heuristics are applied to eliminate incorrect pairs. • Example: Heat shock transcription factor (HSF). • The algorithm finds the correct definition, but not the correct alignment: Heat shock transcription factor
Results • On the “gold-standard” the algorithm achieved 83% recall at 96% precision.* • On a larger test collection the results were 90% recall at 95% precision. • An alternative algorithm, based on modification of the Park and Byrd algorithm using decision lists, achieved only slightly better results – 83% recall at 97% precision, and 90% at 96% precision. • These results show that a very simple algorithm produces results that are comparable to these of the exiting more complex algorithms. * Counting partial matches, and abbreviations missing from the “gold-standard” our algorithm achieved 83% recall at 99% precision.