1 / 24

Checking Terminology Consistency with Statistical Methods

Checking Terminology Consistency with Statistical Methods. LRC XIII. 2 nd October 2008. About this presentation. Introduction Internal Consistency Check Step 1: Mine Source Terms Step 2: Identify translations of Source Terms (Alignment) Step 3: Consistency Check Current Challenges Tips

gisela
Download Presentation

Checking Terminology Consistency with Statistical Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Checking Terminology Consistency with Statistical Methods LRC XIII 2nd October 2008

  2. About this presentation • Introduction • Internal Consistency Check • Step 1: Mine Source Terms • Step 2: Identify translations of Source Terms (Alignment) • Step 3: Consistency Check • Current Challenges • Tips • Future Improvements

  3. Introduction • Terminology Consistency: A key element of localised language quality • Terminology Consistency: Difficult to maintain • Difficulty to keep source and target in synch during dev/loc process • Translation done by several people (often working remotely) • Terminology changes (e.g. between product versions) • Manual Language Quality Assurance (QA) can help, however • QA costs time and money • QA usually concentrates on a sample of the text • Reviewer must be familiar with reference material • It’s hard for humans to keep track of terminology

  4. Introduction • Can we use technology to control consistency? • Yes, but… • Existing tools require term lists or term bases • Not all software companies have term bases set up • Companies that do have term bases won’t have every single term captured – building a term base is always a work in progress

  5. Introduction • Our Approach doesn’t require a term base • By using Term Mining technology we identify terms on the source strings • We then check the translation consistency of the terminology mined

  6. Internal Consistency Check • 1 • 2 • 3 Inconsistency!

  7. Step 1: Source Term Mining • Bigram and Trigram extraction • Noun phrases of the form Noun + Noun Noun + Noun + Noun • Verb Phrases discriminated: 5% of terms • Adjective Phrases discriminated: 2% of terms • Monogram nouns discriminated: most are common words, and only 27% of terms are monograms • In the future we might coverAdj + Noun forms

  8. Step 2: Translation Alignment • Problem statement: Given a mined source term S, identify the corresponding target term T in the translation column. Example: Mined term: “input field” (S)  “champ d’entrée” (T)

  9. Step 2: Translation Alignment • We need to consider all possible term combinations • We call each combination an NGram • NGrams: where N = 2, 3, 4, maybe 5. For languages like German we even consider N = 1 • How do we decide which NGram is the correct translation for the term? • Bayesian statistics can help!

  10. Step 2: Translation Alignment • Problem statement: Given a source term S, obtain the NGram T that maximises the conditional probability function [1] But how do we calculate this?!

  11. Step 2: Translation Alignment [1] Well, the multiplication rule of conditional probability tells us that So [1] becomes: [2] And we also know that: |NGrams| is the number of NGrams of the same N as T. For example, if T is a 2 word term (a bigram), |NGrams| will be the amount of NGrams made up of 2 words. |STSeg| is the number of segments (strings) that contain both S in the source column and T in the target column.

  12. Step 2: Translation Alignment • In our Best Target Term Selection Routine we will be comparing probabilities of different target terms (Tk’s): • Since P(S) remains constant during these comparisons, we can eliminate it. • We call the resulting equation I(Tk): [3] • The candidate Tk with the highest I, is our Best Target Term Candidate

  13. Step 2: Translation Alignment • Normalisation • Depending on context any particular term can be translated in a slightly different way. For example: “file name” could be translated in Spanish as: • nombre de archivo • nombre delarchivo • nombres de archivo • nombres de archivos • nombres de losarchivos • Our algorithm has to be clever enough to realise that “nombres de archivo” is just a form of “nombre de archivo”.

  14. Step 2: Translation Alignment • Normalisation • So, duringNGramgeneration, weneedtogenerate regular expressionsforourterms • SinceAsianlanguages do notinflect, regular expressions are simplerfortheselanguages • ForEuropeanlanguageswe use more complex regular expressions

  15. Step 3: Consistency Check • Detect the strings that do not use any of our admitted translations • Report these strings along with our findings to the user

  16. Current Challenges • False Positives • Due to “heavy” rephrasing • Unreliable for short, generic monograms

  17. Current Challenges • Verbs can potentially cause problems • Due to high inflection: amar => amo, amas, ama, amamos, amáis/aman, amanvenir => vengo, vienes, viene, venimos, venís/vienen, vienen • Difficult to differentiate from other parts of speech • Not all languages supported: • Arabic • Complex Script languages

  18. Current Challenges • Best Candidate Selection logic is very good, but it’s not perfect. About 70% of term selections are correct. Correctselections Incorrectselections Correcttermhighlighted

  19. Tips • Make sure your data is clean to a certain degree. • Remove any HTML/XML tags from your strings • Filter out any unlocalised strings and non-localisable strings. • For Asian languages, run a word breaker tool on your target strings (this is required for proper NGram handling)

  20. Tips • If you already have source term lists you’re interested in, you can use them to bypass the term mining process • If your source terms are well selected, you’ll achieve very good results – A well selected source term has a precise technical meaning.

  21. Tips • The more data you have, the more accurate your results will be • Try combining software data with help / user education data to increase term repetitions

  22. Future Improvements • More work with Adj + Noun • Work with verbs • Add support for Complex Script languages and languages that inflect on different parts of the word • Further refine Best Translation Candidate Selection logic

  23. Questions?

  24. Thank You!

More Related