220 likes | 324 Views
Improving Translation Selection using Conceptual Vectors. LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia. Presentation Overview. Problem Background & Motivation Research Objectives Methodology Advantages & Contributions.
E N D
Improving Translation Selection using Conceptual Vectors LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia
Presentation Overview • Problem Background & Motivation • Research Objectives • Methodology • Advantages & Contributions
Presentation Overview • Problem Background & Motivation • Research Objectives • Methodology • Advantages & Contributions
Natural Language is Ambiguous bank ? ?
Given: a list of meanings/senses of words (dictionaries) input text containing occurrences of ambiguous words Assign the correct sense to particular instance of ambiguous word in context A.k.a. “sense-tagging” Word Sense Disambiguation …. bank#1: a financial institution that accepts deposits and channels the money into lending activities bank#2: sloping land (especially the slope beside a body of water) …. bank#1 …withdraw money from the bank...
Disambiguation in Machine Translation (1) (Malay translations) bank tebing …. bank#1: a financial institution that accepts deposits and channels the money into lending activities bank#2: sloping land (especially the slope beside a bodyof water) …. English input …withdraw money from the bank... sense-tag(WSD) …withdraw money from the bank#1... select translation word That worked well… Malay output …mengeluarkan wang dari bank...
Disambiguation in Machine Translation (2) (Malay translations) edaran (money) penyebaran (berita) …. circulation#6: the spread or transmission of something(as news or money) to a wider group or area …. English input …50 ringgit notes in circulation... sense-tag(WSD) … 50 ringgit notes in circulation#6... translate That DIDN’T work well… Malay output …duit kertas 50 ringgit dalam edaran?? penyebaran?...
Optimising WSD for MT select select (Lee and Kim 2002) Input word Sense number Translation word select
Presentation Overview • Problem Background & Motivation • Research Objectives • Methodology • Advantages & Contributions
Main Objective • Existing MT system: • Selects fragments (translation units) from previously translated examples • Re-combines selected translation units to produce translation output for new input text • Improve the translation quality of this MT system by adapting a WSD algorithm specifically for MT purposes .
Need semantic knowledge about… • Word senses • Use dictionary definitions • Pairs of translation words • From bilingual knowledge bank (BKB) made up of pairs of sentences that are translations of each other • Corresponding words in each translation sentence pair are explicitly marked • Need a model to capture semantic knowledge of lexical items • Conceptual Vectors (Lafourcade 2001) • Using a selection of concepts or themes • Construct mathematical vectors from concepts • Thematic similarity between lexical items ≡ angle between CVs
Need to: • Compile CVs for word meanings on 2 levels: • Word sense (from dictionary) • Word/phrase translation unit (from BKB) using data compiled from previous step • Use compiled information during translation runtime to select correct translation units
Presentation Overview • Problem Background & Motivation • Research Objectives • Methodology • Advantages and Contributions
word → sense numberlevel knowledge Brief Outline Input Text Dictionary / Lexicon Word senses tag “clues” Concept Category Labels matching, comparison, selection BKB Translation Unit Profile(word → translation level knowledge) Examples Translationunits selected translation units Translated Text Data Preparation Phase EBMT Run-time Phase
word → sense numberlevel knowledge During Translation Input Text Dictionary / Lexicon Word senses tag “clues” Concept Category Labels matching, comparison, selection BKB Translation Unit Profile(word → translation level knowledge) Examples Translationunits selected translation units Translated Text Data Preparation Phase EBMT Run-time Phase
Some Results • Translating ‘circulation’ to Malay • edaran or penyebaran • TS: proposed translation selection using CVs • BS: baseline strategy, chooses • the translation that co-occur with the same input words (and same structure) as in the BKB • or the most frequently occuring translation
Presentation Overview • Problem Background & Motivation • Research Objectives • Methodology • Advantages & Contributions
Advantages and Weaknesses • Pros: • optimized for EBMT • focus on translation selection, bypass intermediate WSD at run time • Handles many-to-many mapping of source word sense translation words • allows for bi-directional translation with sense-tagging for 1 language • mathematical operations on vectors are easy to implement • avoids combinatorial effect when multiple ambiguous words in input • Cons: • not all ambiguities can be solved using co-occurring concepts • does not handle translation selection of function words • manual work required in data preparation
Research Contributions • Adaptation of a WSD approach for the specific aim of translation selection • Proposal of specific guidelines for assigning related concepts for word meanings from dictionaries • Production of knowledge about word meanings on two levels: • Word senses as in dictionaries • Translations as in parallel text
Summary • WSD can be customized for different NLP applications accordingly • Different requirements • Increase efficiency • WSD and related tasks based on concepts common to co-occurring word senses can be facilitated using conceptual vector model • Requires a concept category hierarchy and word sense list • Concepts related to a word sense modelled as mathematical vector • Conceptual similarity = angular distance between vectors • Future work • Automating data preparation tasks • Investigating suitable weights or normalizing factors during CV manipulation • Integration with other WSD or translation selection strategies
Future Work • Automate tagging tasks that are currently done manually • Investigate different weight values for CVs for different syntactic relations or word classes • Integrate with other WSD/translation selection tasks