1 / 21

Statistical modelling of MT output corpora for Information Extraction

Statistical modelling of MT output corpora for Information Extraction. Using MT output for IE Requirements and evaluation of usability S-score: measuring the degree of word “significance” for a text by contrasting “text” and “corpus” usages Experiment set-up and MT evaluation metrics

nat
Download Presentation

Statistical modelling of MT output corpora for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical modelling of MT output corpora for Information Extraction

  2. Using MT output for IE Requirements and evaluation of usability S-score: measuring the degree of word “significance” for a text by contrasting “text” and “corpus” usages Experiment set-up and MT evaluation metrics using differences in S-scores for MT evaluation Results of MT evaluation for IE Comparison of MT systems Correlations with human evaluation measures of MT Issues of MT architecture and evaluation scores Conclusions & Future work Overview

  3. Requirements for human use and for automatic processing are different: fluency is less important than adequacy stylistic errors are less important than factual errors, e.g.: MT: * “Bill Fisher” ~ 'to send a bill to a fisher’ Frequency issues: low-frequent words carry the most important information (require accurate disambiguation) Some IE tasks use statistical models (expected to be different for MT) Using MT for IE

  4. Examples: Frequency issues… disambiguation

  5. Research on adaptive IE: automatic template acquisition via statistical means find sentences containing statistically significant words build templates around such sentences Template element fillers (e.g., NEs) often appear among statistically significant words Distribution of word frequencies is expected to be different for MT: checking if this is the case Frequency issues: statistical modelling for IE

  6. Sword[text]-- the score of statistical significance for a particular word in a particular text; Pword[text]-- the relative frequency of the word in the text; Pword[rest-corp] -- the relative frequency of the same word in the rest of the corpus, without this text; Nword[txt-not-found]-- the proportion of texts in the corpus, where this word is not found (number of texts, where it is not found divided by number of texts in the corpus); Pword[all-corp]-- the relative frequency of the word in the whole corpus, including this particular text Measuring statistical significance

  7. Selecting words potentially important for IE: In the MarseilleFacet of the Urba-GraccoAffair, Messrs. Emmanuelli, Laignel, Pezet, and SanmarcoConfronted by the FormerOfficials of the SPResearchDepartment On Wednesday, February 9, the presidingjudge of the Court of CriminalAppeals of Lyon, Henri Blondet, charged with investigating the Marseillefacet of the Urba-Graccoaffair, proceeded with an extensive confrontation among several Socialistdeputies and formerdirectors of Urba-Gracco. Ten persons, including HenriEmmanuelli and Andre Laignel, former treasurers of the SP, Michel Pezet, and Philippe Sanmarco, formerdeputies (SP) from the Bouches-du-Rhône, took part in a hearing which lasted more than seven hours Intuitive appeal of significance scores

  8. Ordering words: ...Intuitive appeal of significance scores

  9. Suggestion: measuring differences in statistical significance for a human translation and MT allows estimating the amount of prospective problems Question: do any human evaluation measures of MT correlate with differences in S-scores for different MT systems? Metric for usability of MT for IE

  10. Available: 100 texts developed for DARPA 94 MT evaluation exercise: French originals 2 different human translations (reference and expert) 5 translations of MT systems ("French into English”): knowledge-based: Systran; Reverso; Metal; Globalink IBM statistical approach to MT: Candide DARPA evaluation scores available for each system and for human expert translation: Informativeness; Adequacy; Fluency Calculating distancesof combined S-scores between: the human reference translation & other translations (MT and the expert translation) Experiment setup

  11. Based on comparing sets of words with S-score > 1 words significant in both texts with different statistical significance scores words not present in the reference translation (overgenerated in MT) words not present in MT, but present in the reference translation (undergenerated in MT) Computing distance scores o-score for «avoiding overgeneration» (~ Presicion) u-score for «avoiding undergeneration» (~ Recall) u&o combined score (calculated as F-measure) The distance scores

  12. Words that changed their significance Computing distance scores... • Overgeneration score: • Undergeneration score:

  13. Scores for avoiding over- and under-generation … Computing distance scores • Making scores compatible across texts • (the number of significant words may be different):

  14. The resulting distance scores

  15. DARPA Adequacy and scores

  16. o-score & DARPA 94 Adequacy

  17. DARPA Fluency and scores

  18. u&o-score and DARPA 94 Fluency

  19. Human expert translation scores higher than MT Statistical MT system «Candide» is characteristically different Strong positive correlation found for: o-score & DARPA adequacy Weak positive correlation found for u&o & DARPA fluency No correlation was found between u-score (high for statistical MT) and human MT evaluation measures Results and correlation of scores

  20. Word-significance measure S – is useful in other areas (e.g., distinguishing lexical and morphological differences) Threshold S > 1 distinguishes content and functional words across different languages (checked for English, French and Russian) Statistical modelling showed substantial differences between human translation and MT output corpora Measures of contrastive frequencies for words in a particular text and the rest of the corpus correlate with human evaluation of MT (scores for adequacy) Conclusions

  21. Statistical modelling of Example-based MT Investigating the actual performance of IE systems on different tasks using MT of different quality (with different "usability for IE" scores) and its correlation with proposed MT evaluation measures Establishing formal properties for intuitive judgements about translation quality (translation equivalence, adequacy, and fluency in human translation and MT) Future work

More Related