220 likes | 307 Views
Automated Compounding as a means for Maximizing Lexical Coverage. Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven. Maximizing Lexical Coverage. Target : Reduction of the number of OOV-words Means : accurate content and organization of the recognizer lexicon
E N D
Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven
Maximizing Lexical Coverage • Target: Reduction of the number of OOV-words • Means: • accurate content and organization of the recognizer lexicon • taking care of a number of productive word formation processes • Evaluation: • implementation of test tool • test results • Conclusions
Lexicon: Content & Organization • Starting point: CGN-lexicon (570.000 entries) • Reduction to one entry per wordform per POS (300.000 entries) • Removal of compounds (160.000 entries) • Selection of most frequent entries (40.000) => Basic Word List (BWL) • Quasi-Word List (QWL): Compounding word parts which don’t appear in BWL
Lexicon Accuracy • Careful selection of the words in BWL: • no compounds • frequent words • Organization of the lexicon: maximal applicability of compounding rules through lexicon split into BWL and QWL
Word Formation Processes • Input: number of word parts that can or cannot be compounded • Hybrid approach: Rule-based + Statistical Filters • Output: • compound + morfo-syntactic info + confidence measure • no compounding possible with given word parts
Word Formation Processes: Input • From BWL: full words, that can be part of a compound or can be words by themselves • From QWL: ‘words’ that can only be part of a compound • 2 up to 5 word parts
Word Formation Processes: Rules • Making use of rules for word formation: e.g.: modifier (N) + head (N) => compound (N) • Input from QWL: word part is N and can only be modifier • Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules • Rules use 2 word parts • When input > 2 word parts: recursivity in rules
Word Formation Processes: Statistics • Relative Frequency Threshold Parameter • Confidence Measure of the Compound Probability
Relative Frequency Threshold • Makes use of relative frequency of POS for a word form • Makes use of a threshold value (0.05%) • If RF > Threshold: POS is used for this wordform • If RF < Threshold: POS is rejected for this wordform • Example: RF(bij(PREP)) = 0.999 > T, RF(bij(N)) = 0.0004<T, only bij(PREP) is used
Confidence Measure of Compounding Probability • estimation of: P(comp(w1=mod, w2=head)) / P(comp(w1=*, w2=head)) where: • P(comp(w1=mod, w2=head)) is the probability that two consecutive word parts form a compound rather than being 2 separate words • P(comp(w1=*, w2=head) is the probability of w2 being a head, with any modifier
Confidence Measure of Compound Probability (2) • If the compound is found in the frequency list, the ratio is estimated like this: [Fr(comp(w1=mod, w2=head))/Fr(comp(w1=*,w2=head))] x (1-Dhead) where: • Fr(comp(w1=mod, w2=head)) is the frequency of the compound that consists of w1 + w2 • Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier • Dhead is the discount parameter: amount of probability reserved for words not in frequency list
Confidence Measure of Compounding Probability (3) • Discount parameter is estimated: Dhead= #diff(mod | head) / Fr(comp(w1=*, w2=head)) where: • #diff(mod | head) is the number of different modifiers occuring with the given head • Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier • (1-Dhead) is the amount of probability reserved for words that can be found in the frequency list
Confidence Measure of Compounding Probability (4) • If the compound is not found in the frequency list, the ratio is estimated like this: Dhead x [Fr(comp(w1=mod, w2=*)) / Fr(*)] where: • Fr(comp(w1=mod, w2=*)) is the frequency of the 1st word part as a modifier of any head • Fr(*) is the total frequency of all words in the frequency list (= 79.862.581)
Confidence Measures: Examples • binnen+kijken • binnenkijkenoccurs in the frequency list • Fr(w1=binnen, w2=kijken) = 10 • Fr(w1=*, w2=kijken) = 2188 • #diff( mod | head=kijken) = 21 • (10 / 2188) x (1 - 21/2188) = 0.0045 • frequentie + tabel • frequentietabeldoes not occur in frequency list • Fr(w1=*, w2=tabel) = 141 • #diff( mod | head=tabel) = 17 • Fr(w1=frequentie,w2=*) = 15 • (17 / 141) x (15 / 79 862 581) = 2.26 e-8
Evaluation • Test System • Test Results
The Test System • Takes a regular text as input • Converts punctuation marks into # • For the test system, a BWL of 35.000 entries was used • Every word is checked in BWL: • if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL) • no compounding rules are used for split up procedure • if no possible split up is found, split up in 3 parts is tried • If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word
The Test System (2) • For every 2 consecutive word parts, it was tested whether they can be compounded or not • Results are compared with original text • False compounding and false identification of noncompounds can be counted this way • Same was done for every 3 consecutive word parts • A threshold was set on the Confidence Measure: If Confidence Measure < Threshold, compound is rejected
Test Results • 3 test texts were used: • Thuis (dialogue of soap series): 3415 words, 3.08% OOV, 1.47 % compounds • Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds • Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds • Most of the OOV’s are proper nouns or non-standard Dutch
Test Results (2) • Correct identification of noncompounds and compounds: • dependent on test text • dependent on parameter thresholds • There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text
Conclusions • Identifying compoundability can be done with an accuracy of 94.5 - 98.5 % • Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)
Conclusions (2) • Capturing already existing compounds by automated compounding proves to be successful • Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower • Automated compounding proves to be a useful means for maximizing lexical coverage