310 likes | 320 Views
Explore the role of ITS 2.0 and Text Analysis in providing contextual information for translators, improving localization processes, and addressing challenges like proper name translation. Learn about data categories such as Localization Note, Terminology, Id Value, Domain, and how Text Analysis can automate proper name translation and suggest terms. Discover use cases, examples, and the manual vs automated generation of annotations.
E N D
Creating Translation Contextwith Disambiguation Tadej Štajner – Jožef Stefan Institute Yves Savourel – ENLASO Corporation Localization World – London – June 2013
Context: A Shortcoming • Traditionally, translation tools have been strong on code handling, re-use of existing translations. • But they have been less good at providing context or linguistic resources for the translators. • Things are improving and are bound to improve even more.
New Factors • Component-based processing is becoming wide-spread (i.e. source text goes through several preparation steps: TM, MT, etc.) • Web services allow a single process to tape on many different resources; specialization becomes easier. • Now ITS 2.0 provides a common way to carry various information across tools/services.
ITS: Internationalization Tag Set • A set of common internationalization and localization-related features (called “data categories”) for XML…and now with ITS 2.0 also for HTML5 • ITS 2.0 is being finalized at the W3Chttp://www.w3.org/TR/its20/
ITS and “Context” • ITS 2.0 offers several data categories that can help with contextual information: Localization Note, Terminology, Id Value, Domain and Text Analysis. • Quick glance at the first four,then in-depth look at Text Analysis.
Localization Note Comments put in the source document and meant to be seen by the translators. <msg its:locNote="%s is for On or Off">Click the %s button</msg>
Terminology Annotates a “term” in the content and, optionally, provides additional related information. <p>We need a new <span its-term=yesits-term-info-ref= "http://en.wikipedia.org/wiki/Motherboard">motherboard</span>.</p>
Id Value Provides a way to associates unique IDs with parts of the content during translation.Can be useful for software text where IDs are often descriptive. <its:idValueRule selector="//msg" idValue="@name"/>...<msg name="FILENOTFOUND">Not found</msg>
Domain Allows to identify the general topic area of the content to translate.Can be useful for selecting MT engines. <its:domainRule selector="/h:html" domainPointer="/h:html/h:head/h:meta[@name='keywords']/@content"/>...<meta name="keywords" content="automotive"/>
Text Analysis • Annotates content with lexical or conceptual information. • Useful for many things: • Term suggestion • General context information • Suggestion of things not to translate • Automated transliteration of proper names • Etc.
Text Analysis: An Example Enrycher is an example of component generating Text Analysis annotations that can be easily integrated with translation tools or localization processes.
Motivation • Translating proper names … can be problematic for statistical MT systems
Motivation (2) • There are specific rules to translate (or transliterate) proper names • Solution: figure out what is actually being mentioned and see if any existing translated expression exists for that entity
Motivation (3) • Examples: personal names, product names, or geographic names, chemical compounds, protein names • Names and phrases appear in situations without sufficient context (UI labels, etc.)
ITS 2.0 Text Analysis • Support text analysis agents that enhance content by suggesting or identifying concepts, identified by IRIs. • A TextAnalysis annotates a text fragment with: • entity type • entity identifier • confidence
Text Analysis in ITS 2.0– what can it tell us? • Does a text fragment represent some entity? • London is lovely in the summer. • Out of 73 known entities named London, we mean a particular one: http://dbpedia.org/resource/London • … a particular type of entity? • London is a phrase, representing a location • … and with what confidence?
ITS 2.0 Text Analysis <!DOCTYPE html> <div its-annotators-ref="text-analysis|http://enrycher.ijs.si/mlw/toolinfo.xml#enrycher"> <span its-ta-ident-ref="http://dbpedia.org/resource/London" its-ta-class-ref="http://schema.org/Place">London</span> is the <span its-ta-ident-ref="http://purl.org/vocabularies/princeton/wn30/synset-capital-noun-3.rdf">capital</span> of <span its-ta-ident-ref="http://dbpedia.org/resource/United_Kingdom" its-ta-class-ref="http://schema.org/Place">United Kingdom</span>. </div>
Producing these annotations • Manual annotation • Automated NLP Techniques • Named entity extraction & disambiguation • Word sense disambiguation
Use cases • Informing a human agent (i.e. translator) that a certain fragment of text is subject to follow specific translation rules: • proper names • officially regulated translations. • Informing a software agent (i.e. CMS) about the conceptual type of a textual entity in order to enable special processing or indexing
Named entity disambiguation Document Entity Label Mention
Named entity disambiguation – behind the scenes • A difficult problem: • A name can refer to many entities, an entity can have many names • Which interpretation is correct? • Humans are pretty good at this • We have prior knowledge on the ‘usual’ meanings • We can glean the meaning from the context • Things that are related, appear together
Named entity disambiguation – behind the scenes (2) • Prior knowledge:what is the most frequent meaning of ‘London’? • Context: someone using the word ‘London’ in the context of ‘Canada’ is likely to be referring to another London in Ontario
Named entity disambiguation – behind the scenes (3) • Relational similarity: things connected in the knowledge graph tend to appear together
Building blocks of Enrycher • Token-level analysis • Sentence splitting • Tokenization • Lemmatization • Part-of-speech tagging • Entity-level analysis • Named entity extraction • Co-reference resolution • Anaphora resolution • Named entity disambiguation • Document-level analysis • Sentiment analysis • Topic classification • Keyword extraction (not used here)
Using Enrycher • A HTTP service endpoint: send HTML5 in, get enriched HTML5+ITS2.0 out • Multilingual: supports English and Slovene • See http://enrycher.ijs.si/mlw/, or try it from the command line: $ curl -d "<p>Welcome to London</p>" http://enrycher.ijs.si/mlw/en/entityIdent.html5its2 <p>Welcome to <span its-ta-ident-ref="http://dbpedia.org/resource/London" its-ta-class-ref="http://schema.org/Place">London</span></p>
Enrycher Integrated in Okapi • The Okapi Framework is an open-source and cross-platform set of components designed to help building localization processes. • One of its components is a client of the Enrycher services. • Text Analysis annotations can be applied to any document in a format supported by the Okapi filters.
One example of usage ofthe Enrycher Web services Enrycher Server Extraction Step Enrycher Step OtherSteps… Trans-Kit Creation Step Term Extraction Step InputFile Translation Kit XLIFF Terms
Enrycher Step • Convert batches of segments (in Okapi’s internal format) into HTML paragraphs and send them to the Enrycher service. • Converts back the annotated paragraphs into Okapi’s internal format. • Next steps can use the Text Analysis metadata, e.g. XLIFF output, OmegaT comments, etc.
Term Extraction Step • The Term Extraction Step offers various simple ways to guess terms in a source content. • One of its methods is to re-use the content annotated with the Text Analysis metadata to feed the list of term candidates.
Questions? • Enrycher:http://enrycher.ijs.si/ • Okapi Framework:http://okapi.opentag.com/ • ITS 2.0:http://www.w3.org/TR/its20/