100 likes | 237 Views
Identifying Translations. Philip Resnik, Noah Smith University of Maryland. Reasons to identify translations. Locating parallel text on the Web Filtering out poor quality translations Cross-language duplicate detection/caching. Comparison. N. %. κ. J1, J2. 267. 0.98. 0.95.
E N D
Identifying Translations Philip Resnik, Noah Smith University of Maryland
Reasons to identify translations • Locating parallel text on the Web • Filtering out poor quality translations • Cross-language duplicate detection/caching
Comparison N % κ J1, J2 267 0.98 0.95 J1, STRAND 273 0.88 0.70 J2, STRAND 315 0.88 0.69 J1J2, STRAND 261 0.90 0.75 Identifying translations using structure STRAND (Resnik, 1999)
Related Work • Web mining for parallel text (Nie et al. 1999) • Sentence alignment (Fluhr et al. 2000) • Duplicate detection (e.g. Broder et al. 1997)
t e = f t t used to define and Translational Equivalence as a Function over Sets • Broder et al (1997): Document representation as a set of “shingles” S(D) |S(D1) S(D2)| r(D1,D2) = |S(D1) S(D2)| • Cross language generalization: partial equality with confidence value t(e,f)
Ways of computing equivalence • Bilingual dictionaries • t(e,f) = 1 if (e,f) present in dictionary, 0 otherwise • Translation model (Melamed 2000, model A) • t(e,f) = Pr(e,f) • String similarity for cognates • t(e,f) = Longest common substring ratio (LCSR) variant • Trained on non-zero entries in translation model
Evaluation task • Given segmented corpus C1 in L1, C2 in L2 • Assume each segment has 0 or 1 translation equivalents • Match up the equivalents • Equivalent to maximum bipartite matching problem • Exhaustive solution available for small sets • Approximated using competitive linking (Melamed) • True equivalence pairs give precision/recall curve
Some results: sentence matching • Task corpora: • Chinese-English: Hong Kong Laws sentences • 5622 training sentences, 191 test sentences • Spanish-English: U.N. Parallel Corpus • 4695 training sentences, 200 test sentences English-Chinese English-Spanish
Some results: document matching • Task corpora: • 232 English-French Web documents
New directions • Exploiting the Internet Archive • 100-200 million pages (4TB) on disk • Exhaustive URL matching within site • STRAND now adapted for disk-based access • Combining structure and content • Improving document-level matching • Selecting good chunks within documents