Context Independent Term Mapper for European Languages

Context Independent Term Mapper for European Languages Mārcis Pinnis(marcis.pinnis@tilde.lv) www.taas-project.eu

Outline Definition of the problem The overall design of the term mapper The term pre-processing The term mapping method Evaluation

Why is terminology important? • Professional communication • Translation • Human translation • Machine translation • Example – affect vs. effect, credit card vs. id card • Terminology does matter – misuse of terminology can cause consequences

What Is It All About? • Term glossaries • Used by humans in professional communication • Used by machine users in machine translation, computer-aided translation, etc. • Manualterm glossary creation is a time consuming process • Automationof certain steps of term glossary creation can help minimising the time required to create term glossaries • Term glossaries can be acquired from parallel/comparable corpora using monolingual term candidate extraction and cross-lingual term mapping techniques

Making terminology available

How do we identify term candidates with automatic methods? * POS/MS tagging – part-of-speech / morpho-syntactic tagging More on term tagging: Pinnis et al., 2012.

An example excerpt from TaaS Gold Standard Data for English TodayNNtodayO 0 ,,,O 0 mostJJSmostO 0 Wi-FiNPwi-fiB-TERM 0.5 networksNNSnetworkI-TERM 0.5 provideVVPprovideO 0 aDTaO 0 fractionNNfractionO 0 ofINofO 0 theDTtheO 0 capacityNNcapacityB-TERM 0.02 enjoyedVVNenjoyO 0 byINbyO 0 wiredVVNwireB-TERM 0.2 usersNNSuserI-TERM 0.2 .SENT.O 0 TokenisedPOS-Lemma-Term- tagged tisedtagged Today, most <TENAME MSD="NP NNS" SCORE="0.5" LEMMA="wi-fi network">Wi-Fi networks</TENAME> provide a fraction of the <TENAME MSD="NN" SCORE="0.02" LEMMA="capacity">capacity</TENAME> enjoyed by <TENAME MSD="VVN NNS" SCORE="0.2" LEMMA="wire network">wired users</TENAME>. Today, most Wi-Fi networks provide a fraction of the capacity enjoyed by wired users.

The Context Independent Term Mapper for European Languages • Performs term mapping using only: • Term pairs as input data (the context is discarded) • Linguistic resources built from a dictionary (probabilistic or manually crafted) • ... if there is one available • Can be run on term-tagged data that comes from sources of different lengths: • 1 term vs. a technical report • a PhD thesis vs. a news article • Comparable corpora • one term pair • etc. (the length of the context in which terms have been identified is irrelevant) • Works in a quadratic search space for: • Term lists • Document pairs

The Design of the Mapper • 2 main components: • Term pre-processing • Term mapping • A prototype consolidation method is itegrated in the public release under: • https://github.com/pmarcis/mp-aligner

The Term Pre-Processing Module • Each term is pre-processed before mapping using resources that are available for a given language pair • Pre-processing: • increases the search space • Generates alternative variants of each token of a term

Pre-processing methods • The following pre-processing methods can be used: • Lowercasing of a token • Motorway  motorway (English  English) • Simple transliteration – each token is rewritten in English letters • индустрия  industrija (Russian  English) • Transliteration (Moses SMT character-based translation) • coordination  koordinācija, koordināciju, koordinācijas (English  Latvian) • Translation(dictionary-based) • computer  dators, kompjūters, skaitļotājs (English  Latvian)

Transliteration Modules • Trained using data extracted from the initial dictionaries • Translations with a Levenshtein distance-based similarity higher than a threshold are used as training and tuning data • The Moses SMT system is used to train transliteration (character-based translation) models • Examples of training data: • a k t i v i t ā t e s a c t i v i t y • h a l o g e n ē t i h a l o g e n a t e d • s a n k c i j a s s a n c t i o n s

Translation Modules • The term mapper can (if available) use a dictionary to search for term token translation hypotheses • Dictionaries can be: • Filtered probabilistic dictionaries • Manually created dictionaries • Format: the simple Giza++ format

An Example of a Pre-processed Term Pair

The Term Mapping Module • The task of the mapping module is to decide whether a term pair can be mapped or not. • The term mapping is performed in three steps: • Identification of content overlaps • Maximisation of content overlaps • Scoring of the term pair

Bi-directional comparison sets After pre-processing terms may have multiple alternative variants (depending on what linguistic resources were used) Each alternative variant is treated as a full substitute of the term

Identification of Content Overlaps At first, for every pre-processed token’s alternative variant, we identify the longest common substringin all other term’s pre-processed alternative variantsthat are in the same language If the longest common substring overlap does not exceed athreshold, the mapper uses a fall-back method based on the Levenshteindistance

Source and Target Token Content Overlap Identification

Content Overlap Identification for Multi-word terms and Compounds The result of this step is a list of binary alignment maps for constituent pairs. The binary alignment mapsfor “chemotherapiedosis” and “dose” are“000000000000011100” and “1110”

An Example of the Fall-back Method Using Levenshtein Distance

Invalid Alignment Dictionary • A dictionary can tell not only what is translated as what, but also what are wrong translations • Building on this idea, pairs of words which are not paired together in a dictionary, but have a Levenshtain distance-based similarity above a threshold, are used as entries for the invalid alignment dictionary • Examples from the German-English dictionary: • rivale revival • replik reply • genetik generic

Maximisation of Content Overlaps The binary alignment lists that were identified in the previous step are used to identify the mapping sequence that maximises the content overlap between the two terms The process is sequentially bi-directional For each source and target term’s token we create one binary alignment map that shows how much content can be nested from the source term to the target term (and vice versa)

Alternative Variants of Terms That Maximise the Content Overlap

Example of the Content Overlap Maximisation for Multi-word terms and Compounds

Scoring of Content Overlaps(and the term pair) The alternative variants of the terms (source and target) that achieve the maximal content overlap are enrolled in two strings (keeping the alignment order) Non-aligned source and target tokens (if there are any) are attached at the end of each string. The enrolled strings are scored using the Levenshtein distance-based similarity metric multiplied by negative multipliers. Negative multipliers are used for translation and transliteration alternatives (if the resources provide such scores)

Evaluation • Automatic • EuroVoc thesaurus • Remember – quadratic search space • Manual • Comparable Web crawled medical corpus for English-Latvian

Automatic Evaluation

ResourcesareImportant(ifyouvaluerecall) English-Latvian results with variable availability of linguistic resources

Direct (source <-> target)Resourcesareeven More Important Latvian-Lithuanianresults with variable availability of linguistic resources

Automatic Evaluation Results Automatic evaluation performed on a thesaurus of official terminology of the European Union (EuroVoc)

The EvaluationforDifferentLanguage Pairs The results (if you brought a telescope) are a little outdated, however the results show that there is an out-of-the-box support for at least 23*22 languages (expected is 25*24)

Manual Evaluation Results Manual evaluation performed on a medical domain Latvian-English comparable corpus

In a Summary – Positive • + • The mapper is able to map multi-word terms andcompound terms • Linguistic resources can be easily generated from a dictionary • If direct resources are unavailable, interlingua resources can still be used • Relatively high precision (up to and over 90%), which can be tuned for the right task (recall vs. precision) • A lot of room for improvements: • Monolingual invalid alignment dictionaries for filtering • Precision can be easily boosted if higher quality probabilistic dictionaries are used

In a Summary – Not so Positive  • - • Cannot map original terms that are not transliterations if they are not in the dictionary • E.g., Latvian-English: dators & computer • This, however is the case for all context-independent methods... • Recall limited if dictionaries are not available • Only terms that are transliterated will be mapped.

Thank you! • The mapper is available on Github: • https://github.com/pmarcis/mp-aligner • If you are looking for the linguistic resources (dictionaries, Moses transliteration modules, etc.), drop me an e-mail: Marcis.Pinnis@tilde.lv • Follow our research on TaaS: http://taas-project.eu/

References used in the presentation Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012). Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012) (pp. 193–208). Madrid.

Context Independent Term Mapper for European Languages

Context Independent Term Mapper for European Languages

Presentation Transcript

Context Free Languages

Context-Free Languages

Context-Free Languages

Context-sensitive Languages

Context-free Languages

Context Free Languages

Context free languages

Context-Free Languages

Context-Free Languages

Context-Free Languages

Context-Free Languages

Context-Free Languages

Context-Free Languages

Languages with Context

ENDANGERED EUROPEAN LANGUAGES

European Context

Context Free Languages

Pumping Lemma for Context-free Languages

European Languages

European Languages

Context-Free Languages

Comparing European Languages: