130 likes | 241 Views
A Universal approach to Building QA Models. Leonid Glazychev, Logrus International Corporation. QA model: general considerations. Reflecting perception and priorities of the target audience Concentrating on factors producing the strongest impression
E N D
A Universal approach to Building QA Models Leonid Glazychev, Logrus International Corporation
QA model: general considerations • Reflecting perception and priorities of the target audience • Concentrating on factors producing the strongest impression • Separating global and local factors/issues • Universal applicability • Covering the whole spectrum of materials • From slightly post-edited MT to ultra-polished manual translations • Same approach for knowledge bases and marketing leaflets • Common approach • Only adjusting acceptance criteria/thresholds based on expectations • Viability • Clear, not overly complicated • Process-oriented, i.e. applicable in the real world • Flexibility • Concentrating on methodology • Particular criteria/issue classification can be taken from elsewhere, for instance: • Based on MQM or other public source • Based on legacy client-sourced criteria…
Real-Life Scenario: No Reference translations available • Two major criteria for any translation are always • Adequacy (Correctly conveys the meaning) and Fluency (Readability) • NEITHER of these depends on translation origin, target audience, brand impact, etc. • No need to delve into technical details or error counts if the text is • Unreadable (incomprehensible) or • Inadequate (inaccurate) • Acceptance thresholds depend on a number of parameters • Goals • Target audience • Speed • Expected longevity and brand impact, etc. • Assessment is relatively quick • Often scanning through the text is sufficient • Especially so when quality is really low • One needs to be bilingual or have a bilingual expert ready just in case
Making Real-Life LQA as Objective as Possible • NONE of the two major criteria are completely objective • An expert panel would produce a normal opinion curve around the average value • In real life there is no expert panel, but a single evaluator! • The grade assigned by this particular person will NOT be arbitrary, but… • It might fall anywhere within the standard ±2σ range • It depends on the individual’s taste, background, etc. • That is why both criteria can be called SEMI-OBJECTIVE or EXPERT OPINION-BASED • Both criteria NOT too accurate by design! • Consequences • EACH of these two major criteria should be evaluated SEPARATELY • Accurate but incomprehensible texts are as useless as fluent but inadequate ones • Two independent “coordinates”, can’t be combined mechanically • EACH should be evaluated on a threshold-based PASS/FAIL basis • Acceptance range needs to accommodate the whole spectrum of potential expert opinions • Marketing text: Between 8 and 10 (10-point scale) • Knowledge base: Between 5 and 8 (10-point scale) • The minimal scale to be used is a 10-point one, to accommodate the normal curve properly • Smaller scales just do not provide sufficient granularity • Acceptance threshold defined by the area, visibility of materials, time constraints, target audience, etc.
The technical factor • Only content that passes on both accounts is further analyzed for technical imperfections • Terminology inconsistency or deviations • Style guides, country standards • Tags, placeholders • Formatting • Technical issues are OBJECTIVE • Grades expected to be similar irrespective of the reviewer’s personality • A typo is still a typo • An error in country standards is still an error anyway • Issue categories can be based on • MQM or other public source • Legacy client-sourced criteria • Error weights and acceptance thresholds depend on multiple factors • Expectations, target audience, time, brand impact, etc. • Each “quality vector” contains error weights for each categoryand acceptance levels • A limited number of “quality vectors” cover the whole spectrum • The resulting technical (objective) quality grade is the third apex of the quality triangle
The quality triangle (or square) ADEQUACY MAJOR ERRORS TECHNICAL FLUENCY Acceptance Range Filters
Case-study: us acaspanish website review • Organized by GALA (Globalization and Localization Association, www.gala-global.org) • Logrus developed and provided methodology • Logrus organized the review and provided analytics • Volunteer effort, crowdsourcing-based approach • Complicated special rules, strict definitions, lengthy training, etc. out of the question • Contributors chosen among language professionals only • Simplified “quality square” methodology applied • Major errors (10 = None, 0 = More than 2) • Readability (fluency, 0 - 10) • Adequacy (accuracy, 0 - 10) • Technical (0 – 10) • 18 language pros reviewing the website: www.CuidadoDeSalud.gov
Case-study: us acaspanish website review (II) • Major errors: None (11), More than 2 (7), 1 grade ignored • Takeaways • Not too objective! • YOURreviewer could contribute to ANY of the bars • Only threshold-based criteria really work Readability / Intelligibility Mean value: 6.2, Std. Deviation: 2.1 Adequacy / Accuracy Mean value: 6.6, Std. Deviation: 1.9
Case-study: us acaspanish website review (III) • Biggest opinion spread for technical errors • Illustrates the gap between professional and crowdsourced work • No detailed criteria or training applied • Should be the most objective factor Technical Issues Mean value: 4.8, Std. Deviation: 2.3 • Overall review results still quite reliable/convincing • Not a big surprise given the website initial quality… • “Obamacare’s poorly translated Spanish website frustrates users”, AP, January 12, 2014
Why semi-objective and objective factors should not be combined • Scope and nature • Objective factors are “local”, each applies to a particular small segment (sentence) • Semi-objective factors typically apply to the text as a whole or its large chunks • Semi-objective evaluations imprecise by definition, can’t be used in formulas • Natural variation might affect the summary score dramatically • Importance/weight • Adequacy and fluency issues are way more important than most others • Their relative weight will exceed everything else by orders of magnitude • Combined summary result too dependent on adequacy/fluency • Almost no sensitivity to other factors • Cost, Time, Viability • No reason to waste time on counting/grading technical errors for an incomprehensible or incorrect text
The “quality triangle/SQUARE” approach recipe • Preparation • Select/build the appropriate issue classification for objective errors • Select/set the acceptance thresholds and error weights vector • Define show-stoppers • Process • Apply expert opinion-based (semi-objective) criteria with a PASS/FAIL result • Adequacy (Accuracy) • Fluency (Readability) • Apply objective criteria based on error classification/typology (acceptable docs only) • Language (spelling & grammar) • References, lack of (over-/under-)translations • Country and other standards • Terminology, Style Guide and explicit client’s guidelines • Tags, placeholders, formatting, etc. • Ignore Subjective Complaints • Obtain 3 or 4 resulting ratings for each reasonably translated document • Adequacy (Accuracy) • Fluency (Readability) • Objective (Technical) error rating • [Major problems]
Summary • QA approach equally applicable to almost all real-life translations (without an existing reference) • Works for MT post-editing or even raw MT output • Complements the MQM back-end providing the methodology for quality assurance • The only things that need to be chosen or fine-tuned are • Issue catalogue (for objective issues/errors) • The vector comprising all acceptance thresholds and error weights • Can be chosen from a limited number of preset templates (content profiles) • See concept details in tcworld as of February, 2012, Of power adapters and language quality assurance: http://www.tcworld.info/tcworld/translation-and-localization/article/of-power-adapters-and-language-quality-assurance/
Separate case: REFERENCE translations available • There are plenty of time/money-saving, automated methods to get a ballpark quality evaluation • Applicability area is narrowed dramatically: • Comparing different MTs or Different versions of the same MT • Evaluating test translations • Results might be quick and cheap, but • Not directly related to quality of the translation • Rather illustrating translation’s closeness to the benchmark one • Can be used for developing/improving MTs or quickly evaluating new translators/students • Very limited usability for real-life translation scenarios