VERTa : Linguistic Features in MT Evaluation

VERTa: Linguistic Features in MT Evaluation Joint work with: Elisabet Comelles (UB) Jordi Atserias (FBM) Victoria Arranz (ELDA/ELRA) Irene Castellon (UB)

Outline • Introduction • Methodology • VERTa • Lexical Similarity Metric • Morphological Similarity Metric • Dependency Similarity Module • N-gram Similarity Module • Metrics Combination • Experiments • Conclusions & Future Work

Introduction • MT metrics: • BLEU • Linguistically-motivated Metrics • Lexical Information (Banerjee&Lavie 2005) • Syntactic Information (Liu &Hildea 2005; He et al. 2010) • Semantic Information (Giménez&Márquez 2007 & 2008a) • Combination of linguistic features: • Machine-learning approach (Leusch and Ney, 2009; Albrecht and Hwa, 2007) • Non-parametric approach (Giménez 2008b &Specia&Giménez 2010) • Our proposal: VERTa (work in progress) • Linguistically-motivated • Combination of linguistic features

Methodology • Several linguistic phenomena need to be taken into account in MT evaluation. • Lexical Semantics: “I believe the situation” vs. “I think the situation” • Syntax: “...a delegation of Moroccan police...” vs. “...a Moroccan police delegation...” “...were assassinated by unknown men...” vs. “...unknown men assassinated...” • Word Order: “... Putin on Thursday announced that...” vs. “Putin announced on Thursday...” • Semantics + Morphology: “.... carrying out an attack in Moscow...” vs. “...Chechenscarry out an attack in Moscow”

METHODOLOGY • Linguistic knowledge organised in different levels: • Lexical Information (Lexical Units) • Morphological Information (Lexical Units & POS) • Syntactic Information (Dependency relations) • Sentence Semantics (Semantic Arguments?) • Evaluation of both Adequacy & Fluency • Results for each module can help Error Analysis

OurProposal: VERTa Weighted Combination Lexical Morphological Dependency Word matches W1 -> W1, W4 W2 -> W3, W22 W3 -> W3 N-gram ….

VERTa: Lexical Similarity Module • Aim: identifying lexical similarities • Lexical matches • System of weights (weighted average)

VERTa: MorphologicalSimilarity Module • Aim: Accuracy • Matches of pairs of features (lexical info + POS) • System of weights (weighted average)

VERTa: DependencySimilarity Module • Aim: Capturing relations between constituents despite their position in the sentence HYP: After a meeting on Monday night with the head of Egyptian intelligence chief Omar SuleimanHaniya said.... REF: Haniya said, after a meeting on Monday evening with the head of Egyptian Intelligence General Omar Suleiman...

VERTa: DependencySimilarity Module • Based on the lexical similarity module • Matches of triples: Label(Head,Mod) • System of weights (weighted average)

VERTa: DependencySimilarity Module (underdevelopment) • Extra-rules at phrase and sentence level. • Examples: HYP: ...between the two ministries of interior... REF: ...between the two interior ministries... HYP_prep_of(ministries, interior) = REF_amod(ministries, interior) HYP: After meeting the Moroccan news agency published a joint statement... REF: A joint statement published (...) by the Moroccan news agency... HYP_nsubj(published, agency) = REF_agent(published, agency)

N-gramSimilarity Module • Aim: identifying linear order of lexical elements • Based on the lexical similarity module word matches • Matching chunks (length= 2 – sentence-length) HYP: … the situation in the area… REF: … the situation in the region…

MetricsCombination • Each metric receives a specific weight depending on: • The type of evaluation • The language evaluated • Set of weights for the experiments: • Adequacy + English: • Lexical Module: 0.444 • Morphology Module: 0.111 • N-gram Module: 0.111 • Dependency Module: 0.333

Experiments • Preliminary tests: • To check the adequacy of the linguistic features used • To reconsider and improve the on-going development of the metric • Aimed at: • Influence of the dependency module • Influence of hyperonyms and hyponyms • Comparing VERTa with other metrics • Data (MetricsMaTr 2010 Shared-Task): • 8 different systems • 4 reference translations • 100 segments/system (28,000 words approx.) • Human judgments based on adequacy

Experiments Influence of the Dependency module and Use of Hyperonyms and Hyponyms  Segment level • The dependency module improves the performance of the metric • The use of hyponyms and hyperonyms decreases the performance of the metric • HYP: …the situation in the area […] is on its danger mark day today… • REF:…the situation in the region […] been as dangerous as it is today…

Experiments VERTa vs. other well-known metrics

Conclusions&FutureWork • The more linguistic information used, the higher the scores are • Use of linguistic information is necessary in MT evaluation • VERTa shows promising results • Preliminary results are helpful to continue with our on-going research: • Reconsidering the linguistic features used + using other linguistic information (NEs, MWs, semantics) • Finishing the dependency module • Tuning of weights • Meta-evaluation: • Analyze each level separately • Evaluate in terms of fluency • Test VERTa with other languages

MANY THANKS!

VERTa : Linguistic Features in MT Evaluation