1 / 23

Finding Translations for Low-Frequency Words in Comparable Corpora

Finding Translations for Low-Frequency Words in Comparable Corpora. Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni ILP, University of Wolverhampton, UK Contact email: v.pekar@wlv.ac.uk. Overview. Distributional Hypothesis and bilingual lexicon acquisition

laszlo
Download Presentation

Finding Translations for Low-Frequency Words in Comparable Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Translations for Low-Frequency Words in Comparable Corpora Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni ILP, University of Wolverhampton, UK Contact email: v.pekar@wlv.ac.uk

  2. Overview • Distributional Hypothesis and bilingual lexicon acquisition • The effect of data sparseness • Methods to model co-occurrence vectors of low-frequency words • Experimental evaluation • Conclusions

  3. Distributional Hypothesis in the bilingual context • Words of different languages that appear in similar contexts are translationally equivalent • Acquisition of bilingual lexicons from comparable, rather thanparallel corpora • Bilingual comparable corpora: not translated texts, but same topic, size, style of presentation • Advantages over parallel corpora: • Broad coverage • Easy domain portability • Virtually unlimited number of language pairs • Parallel corpora = restoration of existing dictionaries

  4. General approach • Comparable corpora in languages L1 and L2 • Words to be aligned: N1 and N2 • Extract co-occurrence data on N1 and N2 from respective corpora: V1and V2 • Create co-occurrence matrices N1×V1, each cell containing f(v,n) or p(v|n) • Create a translation matrix using a bilingual lexicon: V1×V2 • Equivalences between only the core vocabularies • Each cell encodes translation probability • Used to map a vector from L1 to the vector space of L2 • Words with the most similar vectors are taken to be equivalent

  5. Data sparseness • The approach works quite unreliably on all but very frequent words (e.g., Gaussier et al 2004) • Polysemy and synonymy: many-to-many correspondences between the two vocabularies • Noise introduced during the translation between vector spaces

  6. Data sparseness

  7. Dealing with data sparseness • How can one deal with data sparseness? • Various smoothing techniques exist: Good Turing, Kneser-Ney, Katz’s back-off • Previous comparative studies: • Class-based smoothing (Resnik 1993) • Web-based smoothing (Keller&Lapata 2003) • Distance-based averaging (Pereira et al 1993; Dagan et al. 1999)

  8. Distance-based averaging • Probability of an unknown co-occurrence p*(v|n) is estimated from known probabilities of N’, a set of nearest neighbours of n: • where w is a weight with which n’ influences the average of known probabilities of N’; w is computed from distance/similarity between n and n’ • norm is a normalisation factor

  9. Adjusting probabilities for rare co-occurrences • DBA was used to predict unseen probabilities • We’d like predict unseen as well as adjust seen, but unreliable probabilities: • 0 ≤ γ ≤1,the degree to which the seen probability is smoothed with data on the neighbours • Problem: how does one estimate γ?

  10. Heuristical estimation of γ • The less frequent is n, the more it gets smoothed • Log-transformed corpus counts to downplay differences between frequent words

  11. Performance-based estimation of γ • Exact relationship between corpus frequency of n and γ is determined on held-out pairs • The held-out data are split into frequency ranges • Mean rank of the correct equivalent in each range is computed • Function g(x) is interpolated along the mean rank points • g(n) – predicted rank for n • RR - random rank, lowest bound on mean rank

  12. Smoothing functions

  13. Less frequent neighbours • Remove less frequent neighbours, in order to avoid “diluting” corpus-attested probabilities

  14. Experimental setup • 6 language pairs: all combinations with English, French, German, and Spanish • Corpora: • EN: WSJ (87-89), Connexor FDG • FR: Le Monde (94-96), Xerox Xelda • GE: die Tageszeitung (87-89, 94-98), Versley • SP: EFE (94-95), Connexor FDG • Extracted verb-direct object pairs from each corpus

  15. Experimental setup • Translation matrices: • Equivalents between verb synsets in EuroWordNet • Translation probabilities equally distributed among different translations of a source word • Evaluation samples of noun pairs: • 1000 pairs from EWN for each language pair • Sampled from equidistant positions in a sorted frequency list • Divided into 10 frequency ranges • Each noun might have several translations in the sample (1.06 to 1.15 translations)

  16. Experimental setup • Assignment algorithm • To pair each source noun with a correct target noun • Similarity measured using Jensen-Shannon Divergence • Kuhn-Munkres algorithm to determine the most optimal assignment on the entire set • Evaluation measure • Mean rank of the correct equivalent

  17. Baseline: no smoothing

  18. DBA: replace p(v|n) with p*(v|n)

  19. Discard less frequent neighbours significant reduction of Mean Rank: Fr-Ge, Fr-Sp, Ge-Sp

  20. Heuristical estimation of γ significant reduction of Mean Rank: all language pairs

  21. Performance-based estimation of γ significant reduction of Mean Rank: all language pairs

  22. Relationship between k, frequency and Mean Rank

  23. Conclusions • Smoothing co-occurrence data on rare words using intra-language similarities to improve retrieval of their translational equivalents • Extensions of DBA, to smooth rare co-occurrences: • Heuristical (amount of smoothing is a linear function of frequency) • Performance-based (the smoothing function is estimated on held-out data) • Both lead to considerable improvement: • up to 48 ranks reduction (from 146 to 99, 32%) in low frequency ranges • up to 27 ranks reduction (from 81 to 54, 33%) overall

More Related