Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz

Outline • Introduction • State of the Art • Discussion of MT Evaluation Metrics • Hypothesis & Objective • Methodology & Schedule

Introduction • Quickly access to Multilingual Information • Need for quick translation • High increase of MT Systems • Need for evaluation of those MT Systems • Evaluation needs to be quick and reliable

Introduction • Current and mostusedEvaluationMetrics show problems • New approachestoEvaluationusinglinguisticinformation: • Syntacticinfo • Semanticinfo • Ourscenario: • Comparissonbetweenalreadyexistingsystems • Direction of translationto test: English-Spanish

State of the Art • MT absolutely linked to MT Evaluation • Purpose of the evaluation methods: • Error analysis • System comparisson • Chronologically: • Human MT Evaluation • Automatic MT Evaluation

State of the ArtTypes of MT Evaluation • Focused on Context: • Context-based Evaluation (FEMTI) • Evaluates suitability of the MT Technology & the MT System for the user’s purpose • Parameters of analysis: functionality, reliability, usabiility, efficiency, maintainability, portability, cost, etc. • Focused on Quantitiy & Quality: • Human Evaluation and Automatic Evaluation

State of the ArtTypes of MT Evaluation • HumanEvaluation: • Severalapproaches: • Fidelity (ALPAC report) • Intelligibility (ALPAC report) • Comprehensiveevaluation of informativeness (ARPA) • Quality panel evaluation • Adequacy and Fluency (Semantics and Syntax) • PreferredTranslation • Required Post-Editing

State of the ArtTypes of MT Evaluation • Human Evaluation: • Advantage: human evaluators can evaluate the overallqualitiy of the system • Disadvantages: • Time-consuming • Expensive • Subjective

State of the ArtTypes of MT Evaluation • Automatic Evaluation: • Approaches: • Based on Lexical Matching • Based on Syntax • Based on Semantics

State of the ArtTypes of MT Evaluation • Based on Lexical Matching: • Dominant approach to Automatic MT Evaluation • Seeks for lexical similarities between MT output and reference translations • Types: • Edit Distance Measures (WER) • Precision-oriented Measures (BLEU) • Recall-oriented Measures (ROUGE) • Measure balancing Precision & Recall (GTM)

State of the ArtTypes of MT Evaluation • Based on Syntax • Recently developed • Focused on the syntax of the output sentence • Types: • Constituency Parsing • Dependency Parsing • Combination of both analyses (Liu & Gildea 2005)

State of the ArtTypes of MT Evaluation • Based on Semantics: • Recently developed • Focused on the semantics of the output level • Types: • NEs: Quality over NEs (NEE) • Semantic Roles: Similarities over Semantic Roles (SR)

Discussion of MT evaluationMetrics • Human Evaluation: • Advantatges: • Allow to evaluate overall quality • Disadvantatges: • Time-consuming • Expensive • Subjective

Discussion of MT EvaluationMetrics • Automatic Evaluation: • Advantages: • Fast • Not expensive • Objective • Updatable • Disadvantages?

Discussion of MT EvaluationMetrics • AutomaticMetricsbasedon Lexical Matching: • Great advance in MT Research in thelastdecade • Widelyaccepted & usedbythe SMT researchcommunity • BLEU isthemostusedAutomaticMetric • Criticizedbythosenotdeveloping SMT systems • Usuallydependontranslationreferences • Onlytakeintoaccount lexical similarities & disregardsyntax • Biased

Discussion of MT EvaluationMetrics • AutomaticMetricsbasedonSyntax: • Goodimprovement • Works at sentencelevel • OnlyfocusedonSyntax • Whataboutmeaning? • AutomaticmetricsbasedonSemantics: • Goodimprovement • OnlyNEs & Semantic Roles • NEsnottoorelevant • Needfurtherdevelopment • Onlyfocusedonmeaning, whataboutsyntax?

Discussion of MT EvaluationMetrics • Discussion of Automatic Metrics: • Each metric focuses on a partial aspect of quality • Strongly biased evaluations • Unfair comparisson between systems • Overtuning of the system • Need for integration of metrics • Parametric vs. Non-parametric • Evaluation of the quality of a metric combination • Human likeness • Human acceptability

Hypothesis & Objective • Hypothesis: Adding new linguistic information will improve the performance of Automatic Metrics • Main Objective: Proposing a new Automatic Evaluation Metric based on linguistic information.

Hypothesis & Objective • SecondaryObjectives: • Explore linguisticinformation: • Syntacticinfo: POS, shallowparsing, chunking, full parsing, dependencyparsing, constituencyparsing, etc. • Semanticinfo: Semantic Roles, semanticfeatures, Wordnet, Framenet, Lexical Semantics, etc. • Look forlinguisticresourcesappropriatetobecomputationallyprocessed • Look forlinguisticresourcespubliclyavailable • Explore theappropriatewayto combine thisinformation

Methodology & Schedule • 4 stages: • Stage 1 (year 1 & 2): • Bibliography research and analysis: • Detailed exploration and analysis of Automatic Evaluation Metrics • Detailed exploration, analysis and selection of the adequate linguistic information. • Exploration of the feasibility and availability of the linguistic resources needed • Stage 2 (year 1 & 2): • Selection of the Corpus of evaluation

Methodology & Schedule • Stage 3 (year 3): • Experiments on how to combine this linguistic information and the automatic evaluation metrics • Evaluation of our metric combination based on either likeness or acceptability. • Stage 4 (year 4): • Analysis & discussion of the results obtained • Summary of the findings and reflection on the results obtained • Proposal of a new evaluation metric

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics