100 likes | 264 Views
Semantic Evaluation of Machine Translation. Billy Wong, City University of Hong Kong 21 st May 2010. Introduction. Surface text similarity is not a reliable indicator in automatic MT evaluation Insensitive to variation of translation Deeper linguistic analysis is preferred
E N D
Semantic Evaluation of Machine Translation Billy Wong, City University of Hong Kong 21st May 2010
Introduction • Surface text similarity is not a reliable indicator in automatic MT evaluation • Insensitive to variation of translation • Deeper linguistic analysis is preferred • WordNet is widely used for matching synonyms • E.g. METEOR (Banerjee & Lavie 2005), TERp (Snover et al. 2009), ATEC (Wong & Kit 2010)… • Is the similarity of words between MT outputs and references fully described?
Motivation • WordNet • Granularity of sense distinctions is highly fine-grained • Word pairs not in the same sense: • [mom vs mother], [safeguard vs security], [expansion vs extension], [journey vs tour], [impact vs influence]…etc. • Word pairs in similar meaning • Problematic if ignore them in evaluation • What is needed is a word similarity measure • Proposal: • Utilization of word similarity measures in automatic MT evaluation
Word Similarity Measures • Knowledge-based (WordNet) • Wup (Wu & Palmer 1994) • Res (Resnik 1995) • Jcn (Jiang & Conrath 1997) • Hso (Hirst & St-Onge 1998) • Lch (Leacock & Chodorow 1998) • Lin (Lin 1998) • Lesk (Banerjee & Pedersen 2002) • Corpus-based • LSA (Landauer et al. 1998)
Experiment • Three questions: • To what extent two words are considered similar? • Which word similarity measure(s) is/are more appropriate to use? • How much performance gain an MT evaluation metric can obtain by incorporating word similarity measures?
Setting • Data • MetricsMATR08 development data • 1992 MT outputs • 8 MT systems • 4 references • Evaluation metric • Unigram matching • Exact match / synonym / semantically similar • Same weight • Three variants • Precision (p), recall (r) and F-measure (f) where c: MT output t: reference translation
Result (1) • Correlation thresholds of each measure
Result (2) • Correlation of the metric
Conclusion • The importance of semantically similar words in automatic MT evaluation • Two word similarity measures, wup and LSA, perform relatively better • Remaining problems • Semantic similarity vs. Semantic relatedness • E.g. [committee vs chairman] (LSA) • Most WordNet similarity measures run on verbs and nouns only