220 likes | 342 Views
Statistical modelling of MT output corpora for Information Extraction. Using MT output for IE Requirements and evaluation of usability S-score: measuring the degree of word “significance” for a text by contrasting “text” and “corpus” usages Experiment set-up and MT evaluation metrics
E N D
Statistical modelling of MT output corpora for Information Extraction
Using MT output for IE Requirements and evaluation of usability S-score: measuring the degree of word “significance” for a text by contrasting “text” and “corpus” usages Experiment set-up and MT evaluation metrics using differences in S-scores for MT evaluation Results of MT evaluation for IE Comparison of MT systems Correlations with human evaluation measures of MT Issues of MT architecture and evaluation scores Conclusions & Future work Overview
Requirements for human use and for automatic processing are different: fluency is less important than adequacy stylistic errors are less important than factual errors, e.g.: MT: * “Bill Fisher” ~ 'to send a bill to a fisher’ Frequency issues: low-frequent words carry the most important information (require accurate disambiguation) Some IE tasks use statistical models (expected to be different for MT) Using MT for IE
Examples: Frequency issues… disambiguation
Research on adaptive IE: automatic template acquisition via statistical means find sentences containing statistically significant words build templates around such sentences Template element fillers (e.g., NEs) often appear among statistically significant words Distribution of word frequencies is expected to be different for MT: checking if this is the case Frequency issues: statistical modelling for IE
Sword[text]-- the score of statistical significance for a particular word in a particular text; Pword[text]-- the relative frequency of the word in the text; Pword[rest-corp] -- the relative frequency of the same word in the rest of the corpus, without this text; Nword[txt-not-found]-- the proportion of texts in the corpus, where this word is not found (number of texts, where it is not found divided by number of texts in the corpus); Pword[all-corp]-- the relative frequency of the word in the whole corpus, including this particular text Measuring statistical significance
Selecting words potentially important for IE: In the MarseilleFacet of the Urba-GraccoAffair, Messrs. Emmanuelli, Laignel, Pezet, and SanmarcoConfronted by the FormerOfficials of the SPResearchDepartment On Wednesday, February 9, the presidingjudge of the Court of CriminalAppeals of Lyon, Henri Blondet, charged with investigating the Marseillefacet of the Urba-Graccoaffair, proceeded with an extensive confrontation among several Socialistdeputies and formerdirectors of Urba-Gracco. Ten persons, including HenriEmmanuelli and Andre Laignel, former treasurers of the SP, Michel Pezet, and Philippe Sanmarco, formerdeputies (SP) from the Bouches-du-Rhône, took part in a hearing which lasted more than seven hours Intuitive appeal of significance scores
Ordering words: ...Intuitive appeal of significance scores
Suggestion: measuring differences in statistical significance for a human translation and MT allows estimating the amount of prospective problems Question: do any human evaluation measures of MT correlate with differences in S-scores for different MT systems? Metric for usability of MT for IE
Available: 100 texts developed for DARPA 94 MT evaluation exercise: French originals 2 different human translations (reference and expert) 5 translations of MT systems ("French into English”): knowledge-based: Systran; Reverso; Metal; Globalink IBM statistical approach to MT: Candide DARPA evaluation scores available for each system and for human expert translation: Informativeness; Adequacy; Fluency Calculating distancesof combined S-scores between: the human reference translation & other translations (MT and the expert translation) Experiment setup
Based on comparing sets of words with S-score > 1 words significant in both texts with different statistical significance scores words not present in the reference translation (overgenerated in MT) words not present in MT, but present in the reference translation (undergenerated in MT) Computing distance scores o-score for «avoiding overgeneration» (~ Presicion) u-score for «avoiding undergeneration» (~ Recall) u&o combined score (calculated as F-measure) The distance scores
Words that changed their significance Computing distance scores... • Overgeneration score: • Undergeneration score:
Scores for avoiding over- and under-generation … Computing distance scores • Making scores compatible across texts • (the number of significant words may be different):
Human expert translation scores higher than MT Statistical MT system «Candide» is characteristically different Strong positive correlation found for: o-score & DARPA adequacy Weak positive correlation found for u&o & DARPA fluency No correlation was found between u-score (high for statistical MT) and human MT evaluation measures Results and correlation of scores
Word-significance measure S – is useful in other areas (e.g., distinguishing lexical and morphological differences) Threshold S > 1 distinguishes content and functional words across different languages (checked for English, French and Russian) Statistical modelling showed substantial differences between human translation and MT output corpora Measures of contrastive frequencies for words in a particular text and the rest of the corpus correlate with human evaluation of MT (scores for adequacy) Conclusions
Statistical modelling of Example-based MT Investigating the actual performance of IE systems on different tasks using MT of different quality (with different "usability for IE" scores) and its correlation with proposed MT evaluation measures Establishing formal properties for intuitive judgements about translation quality (translation equivalence, adequacy, and fluency in human translation and MT) Future work