170 likes | 347 Views
EXTRACTION OF TRANSLATION CORRESPONDENCES from a parallel corpus using methods of distributional semantics. Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow. Distributional semantics. new area of linguistic research
E N D
EXTRACTION OF TRANSLATION CORRESPONDENCES from a parallel corpus using methods of distributional semantics YuliyaMorozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow
Distributional semantics • new area of linguistic research • inferring semantic properties of linguistic units from corpora • Theoretical foundations: distributional methodology by Z. Harris, F. de Saussure, L. Wittgenstein. • Distributional hypothesis: semantically similar words occur in similar contexts. • J. R. Firth “You shall know a word by the company it keeps”.
Vector space drink coffee – occurred 1 time drink tea – occurred 2 times
Main application areas • lexical ambiguity resolution • information retrieval • dictionaries of semantic relations • multilingual dictionaries • semantic maps of different domains • modelling of synonymy • document topic detection • sentiment analysis
The present research • Goal: to apply distributional semantics models to extraction of translation correspondences from a parallel corpus. • Vector space model + test corpus
Test corpus • Patent texts in French translated into Russian • Texts splitted into sentences • Alignment at the sentence level – manually verified (in the visual editor MakeBilingua) • Uploaded to the Sketch Engine corpus manager
Preprocessing • Lemmatization • Frequent words removed (prepositions , conjunctions etc.) • Punctuation marks removed
Vector space model • type of linguistic units: single words; • type of context: aligned regions; • frequency measure: Boolean frequency (equal either to 1 or 0); • method used to compute the distance between vectors: cosine measure.
Example (aligned region as a context) • Aligned region #1
Results • A list of translation correspondences. • Linguistic filter: the same part of speech. • Precision: 78%.
Correspondences with different POS • Syntactic transformations • verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”) • noun (French) → adjective (Russian) crochet (“hook”) → крюкообразный (“hook-shaped”) • verbal infinitive (French) → adjective (Russian) connaître (“to know”) → известный (“well-known”)
Correspondences with different POS • Parts of multi-word expressions au moins (“at least”) → поменьшеймере (“at least”) • The output of the program: moins → мера
Evaluation • Eduardo Cendejas, Grettel Barceló, Alexander Gelbukh, GrigoriSidorov . Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009. • Vector space model + similarity measures PMI, T-score, Log-likelihood ratio and Dice coefficient. • Precision – 53%.
Conclusion • Distributional semantics methodology can be used to extract translation correspondences from a parallel corpus with a high level of precision. • It can be used to study productive syntactic transformations occurring in translation. • The present vector space model needs to be enhanced to take into account multi-word expressions.