330 likes | 464 Views
A Discriminative Alignment Model for Abbreviation Recognition. Naoaki Okazaki 1 , Sophia Ananiadou 2, 3 Jun’ichi Tsujii 1, 2, 3. 1 The University of Tokyo 2 The University of Manchester 3 The National Centre for Text Mining. Abbreviation recognition (AR).
E N D
A Discriminative Alignment Modelfor Abbreviation Recognition Naoaki Okazaki1, Sophia Ananiadou2, 3 Jun’ichi Tsujii1, 2, 3 1 The University of Tokyo 2 The University of Manchester 3 The National Centre for Text Mining
Abbreviation recognition (AR) • To extract abbreviations and their expanded forms appearing in actual text The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID: 14986424 The 22nd International Conference on Computational Linguistics (Coling 2008)
Abbreviation recognition (AR) • To extract abbreviations and their expanded forms appearing in actual text Definitions The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID: 14986424 Abbreviations (short forms) Term variation Expanded forms (long forms; full forms) The 22nd International Conference on Computational Linguistics (Coling 2008)
AR for disambiguating abbreviations • Sense inventories (abbreviation dictionaries) • Training corpora for disambiguation (context information of expanded forms) • Local definitions The 22nd International Conference on Computational Linguistics (Coling 2008)
AR for disambiguating abbreviations • Sense inventories (abbreviation dictionaries) • What can ‘CT’ stand for? - Acromine http://www.nactem.ac.uk/software/acromine/ The 22nd International Conference on Computational Linguistics (Coling 2008)
AR for disambiguating abbreviations • Sense inventories (abbreviation dictionaries) • Training corpora for disambiguation (context information of expanded forms) ... evaluated using preoperative computed tomography (CT) scan, ... ... by oral administration with the adjuvant cholera toxin (CT), ... Sentences (contexts) in which CT is defined Training Biopsies from bone metastatic lesions were performed under CT scan, ... CT = computed tomography Classifier The 22nd International Conference on Computational Linguistics (Coling 2008)
AR for disambiguating abbreviations • Sense inventories (abbreviation dictionaries) • Training corpora for disambiguation (context information of expanded forms) • Local definitions (one-sense-per-discourse assumption) Mice can be sensitized to food proteins by oral administration with the adjuvant cholera toxin (CT), ... BALB/c mice were fed with CT or PBS. The impact of CT on DC subsets ... The 22nd International Conference on Computational Linguistics (Coling 2008)
AR for disambiguating abbreviations • Sense inventories (abbreviation dictionaries) • Training corpora for disambiguation (context information of expanded forms) • Local definitions • AR plays a key role in managing abbreviations in text The 22nd International Conference on Computational Linguistics (Coling 2008)
Outline of this presentation Introduction (done) This study Methodologies Previous work Unsolved Problems Abbreviation candidate Region for definitions Abbreviation alignment Computing features Maximum entropy modeling Common Experiments Conclusion The 22nd International Conference on Computational Linguistics (Coling 2008)
Step 0: Sample text • The task • Extract an abbreviation definition from this text • We do not extract one if no abbreviation definition is found in the text We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ... TTF-1: thyroid transcription factor 1 The 22nd International Conference on Computational Linguistics (Coling 2008)
Step 1: Abbreviation candidates in parentheses • Parenthetical expressions as clues for abbreviations • Requirements for abbreviation candidates (Schwartz and Hearst, 03): • the inner expression consists of two words at most • the length is between two to ten characters • the expression contains at least an alphabetic letter • the first character is alphanumeric • Abbreviation candidate: y = ‘TTF-1’ We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ... The 22nd International Conference on Computational Linguistics (Coling 2008)
Step 2: Region for extracting abbreviation definitions • Heuristics for regions for finding expanded forms (Schwartz and Hearst, 03) • min(m + 5, 2m) words before the abbreviation, where m is the number of alphanumeric letters in the abbreviation • Take 8 words before the parentheses(m = 4) • The remaining task is to extract a true expanded form (if any) in this region We investigate the effect of thyroid transcription factor 1(TTF-1) in human C cells ... The 22nd International Conference on Computational Linguistics (Coling 2008)
Previous studies: Finding expanded forms • Rule-based • Deterministic algorithm (Schwartz & Hearst, 03) • Maps all alpha-numerical letters in the abbreviation to the expanded form, starting from the end of both the abbreviations and expanded forms right to left. We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ... TTF-1: transcription factor 1 The 22nd International Conference on Computational Linguistics (Coling 2008)
Previous studies: Finding expanded forms • Rule-based • Deterministic algorithm (Schwartz & Hearst, 03) • Four scoring rules for multiple candidates (Adar, 04) • +1: for every abbreviation letter from the head of a word • -1: for every extra word between the definition and parentheses • +1: for definitions followed by the parentheses immediately • -1: for every extra word +4 transcription factor 1 (TTF-1) +5 thyroid transcription factor 1 (TTF-1) The 22nd International Conference on Computational Linguistics (Coling 2008)
Previous studies: Finding expanded forms • Rule-based • Deterministic algorithm (Schwartz & Hearst, 03) • Four scoring rules for multiple candidates (Adar, 04) • Detailed rules (Ao & Takagi, 05) • Two pages long (!) in their paper The 22nd International Conference on Computational Linguistics (Coling 2008)
Previous studies: Finding expanded forms • Rule-based • Deterministic algorithm (Schwartz & Hearst, 03) • Four scoring rules for multiple candidates (Adar, 04) • Detailed rules (Ao & Takagi, 05) • Machine-learning based (Nadeau & Turney 05; Chang & 06) • Aimed at obtaining an optimal set of rules through training • Uses 10-20 features that roughly correspond to the rules proposed by the former work • "# of abbreviation letters matching the first letter of a word“ • “# of abbreviation letters that are capitalized in the definition” The 22nd International Conference on Computational Linguistics (Coling 2008)
Problems in previous studies • Difficult in tweaking the extraction rules by hand • of blood lactate accumulation (OBLA) or • onset of blood lactate accumulation (OBLA) • Difficult in handling non-definitions (negatives) • of postoperative AF in patients submitted to CABG without cardiopulmonary bypass (off-pump) • Difficult in recognizing shuffled abbreviations • receptor of estrogen (ER) • No breakthrough was reported by applying ML • Previous studies used a few features that are reproductions from the rule-based methods The 22nd International Conference on Computational Linguistics (Coling 2008)
This study • Predict origins of abbreviation letters (alignment) • Discriminative training of the abbreviation alignment model • A large amount of features that directly express the events where letters in an expanded form produce or do not produce abbreviation letters • A corpus with abbreviation alignment annotated surrounding expression given by steps 1 and 2 alignment abbreviation candidate Maximum Entropy Modeling possible alignments The 22nd International Conference on Computational Linguistics (Coling 2008)
Step 3: C(x, y): Alignment candidates • x の中で略語に含まれる文字をマークする • y の各文字に対して,x中の文字をひとつだけ割り当てる • まったく文字を割り当てないアライメントを必ず含める • 略語の文字(列)は d 回だけ並び替えてもよい Distortion = 1 (swap ‘thyroid’ and ‘transcription’) Always include a negative alignment (in case that no definition is appropriate) Mark positions of letters that also appear in the abbreviations Associate every alpha-numeric letter in the abbreviation with a letter The 22nd International Conference on Computational Linguistics (Coling 2008)
Step 4: Abbreviation alignment Letters ‘t’ and ‘f’ at these points did not produce abbreviation Prefer non-null mappings that is the most adjacent to t The 22nd International Conference on Computational Linguistics (Coling 2008)
Atomic feature functions • Atomic functions for x • letter type: x_ctype • letter position: x_position • lower-cased letter: x_char • lower-cased word: x_word • part-of-speech code: x_pos • Atomic functions for y • letter type: y_ctype • letter position: y_position • lower-cased letter: y_char • Atomic function for a • a_state: SKIP, MATCH, ABBR • Atomic functions for adjacent x • distance in letters: x_diff • distance in words: x_diff_wd • Atomic function for adjacenty • distance in letters : y_diff • Offset parameter δ • Features are expressed by combinations of atomic functions • Refer to Table 2 for the complete list of combination rules The 22nd International Conference on Computational Linguistics (Coling 2008)
Step 5: Computing features Bigramfeatures MATCH/MATCH/ 1.4 x_diff_wd=1/MATCH/MATCH/ 3.3 y_diff=1/MATCH/MATCH/ 0.8 … 9.1 y_ctype0=U/MATCH/ 0.7 y_position0=H/MATCH/ 0.9 y_ctype0=U;y_position0=H/MATCH/ 1.6 y_char0=t/MATCH/ -0.1 … 5.5 Unigramfeatures x_ctype0=L/y_ctype0=U/MATCH/ 0.5 x_position0=H/y_position0=H/MATCH/ 0.4 x_ctype0=L/y_position0=H/MATCH/ 0.3 x_position0=H/y_ctype0=U/MATCH/ 1.1 … x_ctype0=L/MATCH/ 0.1 x_position0=H/MATCH/ 1.9 x_ctype0=L;x_position0=H/MATCH/ 1.2 x_ctype1=L/MATCH/ 0.3 x_position1=I/MATCH/ 0.2 The 22nd International Conference on Computational Linguistics (Coling 2008)
Step 6: Probabilities of alignments 1.9 0.8 2.1 1.4 2.1 2.3 3.7 4.9 6.2 6.2 2.3 2.3 -3.6 2.7 2.7 2.9 9.1 5.5 The sum of feature weights on the alignment 5.9 8.5 7.7 95.5 83.2 37.5 5.7 Take exponents of these values and normalize as probabilities 6.9 5.3 0.99 4.55e-6 6.47e-26 The 22nd International Conference on Computational Linguistics (Coling 2008)
Maximum entropy modeling • Conditional probability modeled by MaxEnt • Parameter estimation (training) • Maximize the log-likelihood of the probability distribution by applying maximum a posteriori (MAP) estimation Sum of feature weights on the alignment a Vector of the number of occurrences of features on the alignment a Possible alignments Vector of feature weights L1 regularization; solved by OW-LQN method (Andrew, 07) Log-likelihood L2 regularization; solved by L-BFGS method (Nocedal, 80) The 22nd International Conference on Computational Linguistics (Coling 2008)
Training corpus • Corpus for training/evaluation • Selected 1,000 abstracts randomly from MEDLINE • Annotated 1,420 parenthetical expressions manually, and obtained 864 positive instances (aligned abbreviation definitions) and 556 negative instances (non-definitions) ! Measurement of hepatitis Cvirus (HCV) RNA may be beneficial in managing the treatment of … 123 (HCV) ! The mean reduction of UCCA at month 48 was 5.7% for patients initially on placebo who received t reatment at 24 months (RRMS) or … (RRMS) The 22nd International Conference on Computational Linguistics (Coling 2008)
Experiments • Experimental settings (parameters) • L1-regularization or L2-regularization (σ = 3) • No distortion (d = 0) or distortion (d = 1) • Average number of alignment candidates per instance: • 8.46(d = 0) and 69.1(d = 1) • Total number of features generated (d=0): 850,009 • Baseline systems • Schwartz & Hearst (03), SaRAD (Adar, 04), ALICE (Ao, 05) • Chang & Schutze (06), Nadeau & Turney (05) • Test data • Our abbreviation alignment corpus (10-fold cross validation) • Medstract corpus • Our method is trained on our corpus, and tested on this The 22nd International Conference on Computational Linguistics (Coling 2008)
Performance on our corpus • The proposed method achieved the best F1 score of all (simple) rule-based (complex) Machine learning No distortion Proposed Distortion • The inclusion of distorted abbreviations (d=1) gained the higest recall and F1 • Baseline systems with refined heuristics (SaRAD and ALICE) could not outperform the simlest system (S&H) • The previous approaches with machine learning (C&S and N&T) were roughly comparable to rule-based methods • L1 regularization performed better than L2 probably because the number of features are far larger than that of instances The 22nd International Conference on Computational Linguistics (Coling 2008)
Performance on Medstract corpus • The proposed method was trained on our corpus, and applied to the Medstract corpus • Still outperformed the baseline systems • ALICE delivered much better results than S&H • The rules in ALICE might be tuned for this corpus (simple) rule-based (complex) Machine learning Proposed The 22nd International Conference on Computational Linguistics (Coling 2008)
Alignment examples (1) • Shuffled abbreviations were successfully recognized The 22nd International Conference on Computational Linguistics (Coling 2008)
Alignment examples (2) • There are some confusing cases • The proposed method failed to choose the third alignment The 22nd International Conference on Computational Linguistics (Coling 2008)
Top seven features with high weights #1: “Associate a head letter in a definition with an uppercase head letter of the abbreviation” #2: “Produce two abbreviation letters from two consecutive letters in the definition” #3: “Do not produce an abbreviation letter from a lowercase letter whose preceding letter is also lowercase.” #4: “Produce two abbreviation letters from two lowercase letters in the same word” The 22nd International Conference on Computational Linguistics (Coling 2008)
Conclusion • Abbreviation recognition successfully formalized as a sequential alignment problem • Showed remarkable improvements over previous methods • Obtained fine-grained features that express the events wherein an expanded form produces an abbreviation letter • Future work • To handle different patters (e.g., ”aka”, “abbreviated as”) • To combine with the statistical approach (Okazaki, 06) • Construct a comprehensible abbreviation dictionary based on the n-best solutions and statistics of occurrences • To train the alignment model from a non-aligned corpus, inducing abbreviation alignments simultaneously The 22nd International Conference on Computational Linguistics (Coling 2008)
More and more abbreviations produced SaRAD (Adar, 04) extracted 6,574,953 abbreviation definitions from the whole of the MEDLINE database released in 2006 The 22nd International Conference on Computational Linguistics (Coling 2008)