Automated Compounding as a means for Maximizing Lexical Coverage

Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Maximizing Lexical Coverage • Target: Reduction of the number of OOV-words • Means: • accurate content and organization of the recognizer lexicon • taking care of a number of productive word formation processes • Evaluation: • implementation of test tool • test results • Conclusions

Lexicon: Content & Organization • Starting point: CGN-lexicon (570.000 entries) • Reduction to one entry per wordform per POS (300.000 entries) • Removal of compounds (160.000 entries) • Selection of most frequent entries (40.000) => Basic Word List (BWL) • Quasi-Word List (QWL): Compounding word parts which don’t appear in BWL

Lexicon Accuracy • Careful selection of the words in BWL: • no compounds • frequent words • Organization of the lexicon: maximal applicability of compounding rules through lexicon split into BWL and QWL

Word Formation Processes • Input: number of word parts that can or cannot be compounded • Hybrid approach: Rule-based + Statistical Filters • Output: • compound + morfo-syntactic info + confidence measure • no compounding possible with given word parts

Word Formation Processes: Input • From BWL: full words, that can be part of a compound or can be words by themselves • From QWL: ‘words’ that can only be part of a compound • 2 up to 5 word parts

Word Formation Processes: Rules • Making use of rules for word formation: e.g.: modifier (N) + head (N) => compound (N) • Input from QWL: word part is N and can only be modifier • Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules • Rules use 2 word parts • When input > 2 word parts: recursivity in rules

Word Formation Processes: Statistics • Relative Frequency Threshold Parameter • Confidence Measure of the Compound Probability

Relative Frequency Threshold • Makes use of relative frequency of POS for a word form • Makes use of a threshold value (0.05%) • If RF > Threshold: POS is used for this wordform • If RF < Threshold: POS is rejected for this wordform • Example: RF(bij(PREP)) = 0.999 > T, RF(bij(N)) = 0.0004<T, only bij(PREP) is used

Confidence Measure of Compounding Probability • estimation of: P(comp(w1=mod, w2=head)) / P(comp(w1=*, w2=head)) where: • P(comp(w1=mod, w2=head)) is the probability that two consecutive word parts form a compound rather than being 2 separate words • P(comp(w1=*, w2=head) is the probability of w2 being a head, with any modifier

Confidence Measure of Compound Probability (2) • If the compound is found in the frequency list, the ratio is estimated like this: [Fr(comp(w1=mod, w2=head))/Fr(comp(w1=*,w2=head))] x (1-Dhead) where: • Fr(comp(w1=mod, w2=head)) is the frequency of the compound that consists of w1 + w2 • Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier • Dhead is the discount parameter: amount of probability reserved for words not in frequency list

Confidence Measure of Compounding Probability (3) • Discount parameter is estimated: Dhead= #diff(mod | head) / Fr(comp(w1=*, w2=head)) where: • #diff(mod | head) is the number of different modifiers occuring with the given head • Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier • (1-Dhead) is the amount of probability reserved for words that can be found in the frequency list

Confidence Measure of Compounding Probability (4) • If the compound is not found in the frequency list, the ratio is estimated like this: Dhead x [Fr(comp(w1=mod, w2=*)) / Fr(*)] where: • Fr(comp(w1=mod, w2=*)) is the frequency of the 1st word part as a modifier of any head • Fr(*) is the total frequency of all words in the frequency list (= 79.862.581)

Confidence Measures: Examples • binnen+kijken • binnenkijkenoccurs in the frequency list • Fr(w1=binnen, w2=kijken) = 10 • Fr(w1=*, w2=kijken) = 2188 • #diff( mod | head=kijken) = 21 • (10 / 2188) x (1 - 21/2188) = 0.0045 • frequentie + tabel • frequentietabeldoes not occur in frequency list • Fr(w1=*, w2=tabel) = 141 • #diff( mod | head=tabel) = 17 • Fr(w1=frequentie,w2=*) = 15 • (17 / 141) x (15 / 79 862 581) = 2.26 e-8

Evaluation • Test System • Test Results

The Test System • Takes a regular text as input • Converts punctuation marks into # • For the test system, a BWL of 35.000 entries was used • Every word is checked in BWL: • if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL) • no compounding rules are used for split up procedure • if no possible split up is found, split up in 3 parts is tried • If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

The Test System (2) • For every 2 consecutive word parts, it was tested whether they can be compounded or not • Results are compared with original text • False compounding and false identification of noncompounds can be counted this way • Same was done for every 3 consecutive word parts • A threshold was set on the Confidence Measure: If Confidence Measure < Threshold, compound is rejected

Test Results • 3 test texts were used: • Thuis (dialogue of soap series): 3415 words, 3.08% OOV, 1.47 % compounds • Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds • Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds • Most of the OOV’s are proper nouns or non-standard Dutch

Test Results (2) • Correct identification of noncompounds and compounds: • dependent on test text • dependent on parameter thresholds • There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

Test Results (3)

Conclusions • Identifying compoundability can be done with an accuracy of 94.5 - 98.5 % • Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)

Conclusions (2) • Capturing already existing compounds by automated compounding proves to be successful • Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower • Automated compounding proves to be a useful means for maximizing lexical coverage

Automated Compounding as a means for Maximizing Lexical Coverage

Automated Compounding as a means for Maximizing Lexical Coverage

Presentation Transcript

Microscopy as a Means for Nano-Characterization

Hierarchical Temporal Memory as a Means for Image Recognition

A method for unsupervised broad-coverage lexical error detection and correction

Compounding

Maximizing Golf as a Business Tool!

Problem Solving as a “Means” not as an “End”

Maximizing Your Impact as a Para Professional

Preparedness as a Means for Survival and Solidarity

Compounding

Radio As A means of Communication

Maximizing Wealth Means Maximizing What Others See as our Wealth.

Maximizing Your Role as a Teen Influencer:

Multimedia as a means of communication for children

COMPOUNDING

Maximizing Golf as a Business Tool!

Maximizing Golf as a Business Tool!

A Compounding Pharmacy

Compounding

Compounding

COMPOUNDING

A method for unsupervised broad-coverage lexical error detection and correction

Informal Workouts as a Means for Addressing Legal Shortcomings