1 / 22

Automated Compounding as a means for Maximizing Lexical Coverage

Automated Compounding as a means for Maximizing Lexical Coverage. Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven. Maximizing Lexical Coverage. Target : Reduction of the number of OOV-words Means : accurate content and organization of the recognizer lexicon

woods
Download Presentation

Automated Compounding as a means for Maximizing Lexical Coverage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

  2. Maximizing Lexical Coverage • Target: Reduction of the number of OOV-words • Means: • accurate content and organization of the recognizer lexicon • taking care of a number of productive word formation processes • Evaluation: • implementation of test tool • test results • Conclusions

  3. Lexicon: Content & Organization • Starting point: CGN-lexicon (570.000 entries) • Reduction to one entry per wordform per POS (300.000 entries) • Removal of compounds (160.000 entries) • Selection of most frequent entries (40.000) => Basic Word List (BWL) • Quasi-Word List (QWL): Compounding word parts which don’t appear in BWL

  4. Lexicon Accuracy • Careful selection of the words in BWL: • no compounds • frequent words • Organization of the lexicon: maximal applicability of compounding rules through lexicon split into BWL and QWL

  5. Word Formation Processes • Input: number of word parts that can or cannot be compounded • Hybrid approach: Rule-based + Statistical Filters • Output: • compound + morfo-syntactic info + confidence measure • no compounding possible with given word parts

  6. Word Formation Processes: Input • From BWL: full words, that can be part of a compound or can be words by themselves • From QWL: ‘words’ that can only be part of a compound • 2 up to 5 word parts

  7. Word Formation Processes: Rules • Making use of rules for word formation: e.g.: modifier (N) + head (N) => compound (N) • Input from QWL: word part is N and can only be modifier • Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules • Rules use 2 word parts • When input > 2 word parts: recursivity in rules

  8. Word Formation Processes: Statistics • Relative Frequency Threshold Parameter • Confidence Measure of the Compound Probability

  9. Relative Frequency Threshold • Makes use of relative frequency of POS for a word form • Makes use of a threshold value (0.05%) • If RF > Threshold: POS is used for this wordform • If RF < Threshold: POS is rejected for this wordform • Example: RF(bij(PREP)) = 0.999 > T, RF(bij(N)) = 0.0004<T, only bij(PREP) is used

  10. Confidence Measure of Compounding Probability • estimation of: P(comp(w1=mod, w2=head)) / P(comp(w1=*, w2=head)) where: • P(comp(w1=mod, w2=head)) is the probability that two consecutive word parts form a compound rather than being 2 separate words • P(comp(w1=*, w2=head) is the probability of w2 being a head, with any modifier

  11. Confidence Measure of Compound Probability (2) • If the compound is found in the frequency list, the ratio is estimated like this: [Fr(comp(w1=mod, w2=head))/Fr(comp(w1=*,w2=head))] x (1-Dhead) where: • Fr(comp(w1=mod, w2=head)) is the frequency of the compound that consists of w1 + w2 • Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier • Dhead is the discount parameter: amount of probability reserved for words not in frequency list

  12. Confidence Measure of Compounding Probability (3) • Discount parameter is estimated: Dhead= #diff(mod | head) / Fr(comp(w1=*, w2=head)) where: • #diff(mod | head) is the number of different modifiers occuring with the given head • Fr(comp(w1=*, w2=head)) is the frequency of the 2nd word part as a head, with any modifier • (1-Dhead) is the amount of probability reserved for words that can be found in the frequency list

  13. Confidence Measure of Compounding Probability (4) • If the compound is not found in the frequency list, the ratio is estimated like this: Dhead x [Fr(comp(w1=mod, w2=*)) / Fr(*)] where: • Fr(comp(w1=mod, w2=*)) is the frequency of the 1st word part as a modifier of any head • Fr(*) is the total frequency of all words in the frequency list (= 79.862.581)

  14. Confidence Measures: Examples • binnen+kijken • binnenkijkenoccurs in the frequency list • Fr(w1=binnen, w2=kijken) = 10 • Fr(w1=*, w2=kijken) = 2188 • #diff( mod | head=kijken) = 21 • (10 / 2188) x (1 - 21/2188) = 0.0045 • frequentie + tabel • frequentietabeldoes not occur in frequency list • Fr(w1=*, w2=tabel) = 141 • #diff( mod | head=tabel) = 17 • Fr(w1=frequentie,w2=*) = 15 • (17 / 141) x (15 / 79 862 581) = 2.26 e-8

  15. Evaluation • Test System • Test Results

  16. The Test System • Takes a regular text as input • Converts punctuation marks into # • For the test system, a BWL of 35.000 entries was used • Every word is checked in BWL: • if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL) • no compounding rules are used for split up procedure • if no possible split up is found, split up in 3 parts is tried • If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

  17. The Test System (2) • For every 2 consecutive word parts, it was tested whether they can be compounded or not • Results are compared with original text • False compounding and false identification of noncompounds can be counted this way • Same was done for every 3 consecutive word parts • A threshold was set on the Confidence Measure: If Confidence Measure < Threshold, compound is rejected

  18. Test Results • 3 test texts were used: • Thuis (dialogue of soap series): 3415 words, 3.08% OOV, 1.47 % compounds • Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds • Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds • Most of the OOV’s are proper nouns or non-standard Dutch

  19. Test Results (2) • Correct identification of noncompounds and compounds: • dependent on test text • dependent on parameter thresholds • There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

  20. Test Results (3)

  21. Conclusions • Identifying compoundability can be done with an accuracy of 94.5 - 98.5 % • Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of 36.000 entries (BWL+QWL)

  22. Conclusions (2) • Capturing already existing compounds by automated compounding proves to be successful • Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower • Automated compounding proves to be a useful means for maximizing lexical coverage

More Related