240 likes | 404 Views
Checking Terminology Consistency with Statistical Methods. LRC XIII. 2 nd October 2008. About this presentation. Introduction Internal Consistency Check Step 1: Mine Source Terms Step 2: Identify translations of Source Terms (Alignment) Step 3: Consistency Check Current Challenges Tips
E N D
Checking Terminology Consistency with Statistical Methods LRC XIII 2nd October 2008
About this presentation • Introduction • Internal Consistency Check • Step 1: Mine Source Terms • Step 2: Identify translations of Source Terms (Alignment) • Step 3: Consistency Check • Current Challenges • Tips • Future Improvements
Introduction • Terminology Consistency: A key element of localised language quality • Terminology Consistency: Difficult to maintain • Difficulty to keep source and target in synch during dev/loc process • Translation done by several people (often working remotely) • Terminology changes (e.g. between product versions) • Manual Language Quality Assurance (QA) can help, however • QA costs time and money • QA usually concentrates on a sample of the text • Reviewer must be familiar with reference material • It’s hard for humans to keep track of terminology
Introduction • Can we use technology to control consistency? • Yes, but… • Existing tools require term lists or term bases • Not all software companies have term bases set up • Companies that do have term bases won’t have every single term captured – building a term base is always a work in progress
Introduction • Our Approach doesn’t require a term base • By using Term Mining technology we identify terms on the source strings • We then check the translation consistency of the terminology mined
Internal Consistency Check • 1 • 2 • 3 Inconsistency!
Step 1: Source Term Mining • Bigram and Trigram extraction • Noun phrases of the form Noun + Noun Noun + Noun + Noun • Verb Phrases discriminated: 5% of terms • Adjective Phrases discriminated: 2% of terms • Monogram nouns discriminated: most are common words, and only 27% of terms are monograms • In the future we might coverAdj + Noun forms
Step 2: Translation Alignment • Problem statement: Given a mined source term S, identify the corresponding target term T in the translation column. Example: Mined term: “input field” (S) “champ d’entrée” (T)
Step 2: Translation Alignment • We need to consider all possible term combinations • We call each combination an NGram • NGrams: where N = 2, 3, 4, maybe 5. For languages like German we even consider N = 1 • How do we decide which NGram is the correct translation for the term? • Bayesian statistics can help!
Step 2: Translation Alignment • Problem statement: Given a source term S, obtain the NGram T that maximises the conditional probability function [1] But how do we calculate this?!
Step 2: Translation Alignment [1] Well, the multiplication rule of conditional probability tells us that So [1] becomes: [2] And we also know that: |NGrams| is the number of NGrams of the same N as T. For example, if T is a 2 word term (a bigram), |NGrams| will be the amount of NGrams made up of 2 words. |STSeg| is the number of segments (strings) that contain both S in the source column and T in the target column.
Step 2: Translation Alignment • In our Best Target Term Selection Routine we will be comparing probabilities of different target terms (Tk’s): • Since P(S) remains constant during these comparisons, we can eliminate it. • We call the resulting equation I(Tk): [3] • The candidate Tk with the highest I, is our Best Target Term Candidate
Step 2: Translation Alignment • Normalisation • Depending on context any particular term can be translated in a slightly different way. For example: “file name” could be translated in Spanish as: • nombre de archivo • nombre delarchivo • nombres de archivo • nombres de archivos • nombres de losarchivos • Our algorithm has to be clever enough to realise that “nombres de archivo” is just a form of “nombre de archivo”.
Step 2: Translation Alignment • Normalisation • So, duringNGramgeneration, weneedtogenerate regular expressionsforourterms • SinceAsianlanguages do notinflect, regular expressions are simplerfortheselanguages • ForEuropeanlanguageswe use more complex regular expressions
Step 3: Consistency Check • Detect the strings that do not use any of our admitted translations • Report these strings along with our findings to the user
Current Challenges • False Positives • Due to “heavy” rephrasing • Unreliable for short, generic monograms
Current Challenges • Verbs can potentially cause problems • Due to high inflection: amar => amo, amas, ama, amamos, amáis/aman, amanvenir => vengo, vienes, viene, venimos, venís/vienen, vienen • Difficult to differentiate from other parts of speech • Not all languages supported: • Arabic • Complex Script languages
Current Challenges • Best Candidate Selection logic is very good, but it’s not perfect. About 70% of term selections are correct. Correctselections Incorrectselections Correcttermhighlighted
Tips • Make sure your data is clean to a certain degree. • Remove any HTML/XML tags from your strings • Filter out any unlocalised strings and non-localisable strings. • For Asian languages, run a word breaker tool on your target strings (this is required for proper NGram handling)
Tips • If you already have source term lists you’re interested in, you can use them to bypass the term mining process • If your source terms are well selected, you’ll achieve very good results – A well selected source term has a precise technical meaning.
Tips • The more data you have, the more accurate your results will be • Try combining software data with help / user education data to increase term repetitions
Future Improvements • More work with Adj + Noun • Work with verbs • Add support for Complex Script languages and languages that inflect on different parts of the word • Further refine Best Translation Candidate Selection logic