1 / 23

Presented by SUN Jun

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu. Presented by SUN Jun. MT’s often bad. MT3: So far , the sale in the mainland of China for nearly two months of SK – II line of products MT1:

hien
Download Presentation

Presented by SUN Jun

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MEANT: semi-automatic metric for evaluatingfor MTevaluation via semantic framesan asembling of ACL11,IJCAI11,SSST11Chi-kiu Lo & Dekai Wu Presented by SUN Jun

  2. MT’s often bad • MT3: So far , the sale in the mainland of China for nearly two months of SK – II line of products • MT1: So far , nearly two months sk –ii the sale of products in the mainland of China to resume sales. • MT2: So far, in the mainland of China to stop selling nearly two months of SK – 2 products sales resumed. • Ref: Until after their sales had ceased in mainland China for almost tow months, sales of the complete range of SK – II products have now been resumed. BLEU: 0.124 BLEU: 0.012 BLEU: 0.013

  3. Metrics besides BLEU have problems • Lexical similarity based metrics (eg. NIST, METEOR) • Good at capturing fluency • Correlate poorly with human judgment on adequacy • Syntax based (eg. STM, Liu and Gildea, 2005) • Much better at capturing grammaticality • Still more fluency oriented than adequacy-oriented • Non-automatic metrics (eg. HTER) • Use human annotators to solve non-trivial problem of finding min edit distance to evaluate adequacy • Human-training & Labor intensive

  4. MEANT:SRL for MT evaluation • Intuition behind the idea: • Useful translation help users accurately understand the basic event structure of source utterances—“ who did what to whom, when, where and why” . • Hypothesis of the work: • MT utility can best be evaluated via SRL • Better than: • N-gram based metrics like BLEU (adequacy) • Human training intensive metrics like HTER (time cost) • Complex aggregate metrics like ULC (representation transparency)

  5. Q. Do PRED & ARGj correlate to human adequacy judgments?

  6. Q. Do PRED & ARGj correlate to human adequacy judgments?

  7. Experimental settings • Exp settings 1 -- Corpus • ACL11: draw 40 sentences from Newswire datasets in GALE P2.5 (with SRL in ref/src, 3-output) • IJCAI11: draw 40, draw 35 from previous data set and draw 39 from broadcast news WMT2010-MetricsMaTr

  8. Experimental settings • Exp settings 2 – Annotation of SRL on MT reference and output • SRL: Propbank style

  9. Experimental settings • Exp settings 3 –SRL evaluation as MT evaluation • Correct, incorrect, partial (predicate & argument) • Partial: part of the meaning is correctly translated • Extra meaning in a role filler is not penalized unless it belongs in another role • Incorrectly translated predicate means the entire frame is wrong (no count of arguments)

  10. Experimental settings • Exp settings 3 –SRL evaluation as MT evaluation • F-measure Based scores • weights tuned by confusion Matrix on dev

  11. Experimental settings • Exp settings 4 – Evaluation of evaluation • WMT and NIST MetricsMaTr (2010) • Kendall’s τ rank correlation coefficient • evaluate the correlation of the proposed metric with human judgments on translation adequacy ranking. • A higher value for τ indicates more similarity to the ranking by the evaluation metric to the human judgment. • The range of possible values of correlation coefficient is [-1,1], where 1 means the systems are ranked

  12. Observations • HMEANT vs other metric

  13. Observations • HMEANT on CV data

  14. Observations • HMEANT annotated via Mono vs Bi-lingual Error analysis: annotators drop parts of the meaning in the translation when trying to align them to the source input

  15. Observations • HMEANT vs MEANT (automatic SRL) • SRL tool: ASSERT, 87% (Pradhan et al. 2004) 80 %

  16. Q2: Impact of each individual semantic role to the metric’s correlation • A preliminary exp • For each ARGj , PRED, we manually compared each English MT output against its reference translation. Using the counts thus obtained, we computed the precision, recall, and f-score for PRED and each ARGj type.

  17. IJCAI 11: evaluation the individual impact • The preliminary exp suggest effectiveness • Propose metrics for evaluating individual impact

  18. IJCAI 11: evaluation the individual impact • The preliminary exp suggest effectiveness

  19. IJCAI 11: evaluation the individual impact • Results

  20. IJCAI 11: evaluation the individual impact • Results2 Automatic SRL tool 76-93%

  21. Q: Can be Even more Accurate? • SSST11...

  22. Conclusion • ACL11 • Bring MEANT, HMEANT • HMEANT correlates well to human judges, as well as more expensive HTER • Automatic SRL yields 80% correlations • IJCAI11 • Study impact of each individual semantic roles • SSST11 • Propose Length based weighting scheme to evaluate contribution of each semantic frame

  23. END

More Related