310 likes | 320 Views
Creating Translation Context with Disambiguation. Tadej Štajner – Jožef Stefan Institute Yves Savourel – ENLASO Corporation. Localization World – London – June 2013. Context: A Shortcoming. Traditionally, translation tools have been strong on code handling, re-use of existing translations.
E N D
Creating Translation Contextwith Disambiguation Tadej Štajner – Jožef Stefan Institute Yves Savourel – ENLASO Corporation Localization World – London – June 2013
Context: A Shortcoming • Traditionally, translation tools have been strong on code handling, re-use of existing translations. • But they have been less good at providing context or linguistic resources for the translators. • Things are improving and are bound to improve even more.
New Factors • Component-based processing is becoming wide-spread (i.e. source text goes through several preparation steps: TM, MT, etc.) • Web services allow a single process to tape on many different resources; specialization becomes easier. • Now ITS 2.0 provides a common way to carry various information across tools/services.
ITS: Internationalization Tag Set • A set of common internationalization and localization-related features (called “data categories”) for XML…and now with ITS 2.0 also for HTML5 • ITS 2.0 is being finalized at the W3Chttp://www.w3.org/TR/its20/
ITS and “Context” • ITS 2.0 offers several data categories that can help with contextual information: Localization Note, Terminology, Id Value, Domain and Text Analysis. • Quick glance at the first four,then in-depth look at Text Analysis.
Localization Note Comments put in the source document and meant to be seen by the translators. <msg its:locNote="%s is for On or Off">Click the %s button</msg>
Terminology Annotates a “term” in the content and, optionally, provides additional related information. <p>We need a new <span its-term=yesits-term-info-ref= "http://en.wikipedia.org/wiki/Motherboard">motherboard</span>.</p>
Id Value Provides a way to associates unique IDs with parts of the content during translation.Can be useful for software text where IDs are often descriptive. <its:idValueRule selector="//msg" idValue="@name"/>...<msg name="FILENOTFOUND">Not found</msg>
Domain Allows to identify the general topic area of the content to translate.Can be useful for selecting MT engines. <its:domainRule selector="/h:html" domainPointer="/h:html/h:head/h:meta[@name='keywords']/@content"/>...<meta name="keywords" content="automotive"/>
Text Analysis • Annotates content with lexical or conceptual information. • Useful for many things: • Term suggestion • General context information • Suggestion of things not to translate • Automated transliteration of proper names • Etc.
Text Analysis: An Example Enrycher is an example of component generating Text Analysis annotations that can be easily integrated with translation tools or localization processes.
Motivation • Translating proper names … can be problematic for statistical MT systems
Motivation (2) • There are specific rules to translate (or transliterate) proper names • Solution: figure out what is actually being mentioned and see if any existing translated expression exists for that entity
Motivation (3) • Examples: personal names, product names, or geographic names, chemical compounds, protein names • Names and phrases appear in situations without sufficient context (UI labels, etc.)
ITS 2.0 Text Analysis • Support text analysis agents that enhance content by suggesting or identifying concepts, identified by IRIs. • A TextAnalysis annotates a text fragment with: • entity type • entity identifier • confidence
Text Analysis in ITS 2.0– what can it tell us? • Does a text fragment represent some entity? • London is lovely in the summer. • Out of 73 known entities named London, we mean a particular one: http://dbpedia.org/resource/London • … a particular type of entity? • London is a phrase, representing a location • … and with what confidence?
ITS 2.0 Text Analysis <!DOCTYPE html> <div its-annotators-ref="text-analysis|http://enrycher.ijs.si/mlw/toolinfo.xml#enrycher"> <span its-ta-ident-ref="http://dbpedia.org/resource/London" its-ta-class-ref="http://schema.org/Place">London</span> is the <span its-ta-ident-ref="http://purl.org/vocabularies/princeton/wn30/synset-capital-noun-3.rdf">capital</span> of <span its-ta-ident-ref="http://dbpedia.org/resource/United_Kingdom" its-ta-class-ref="http://schema.org/Place">United Kingdom</span>. </div>
Producing these annotations • Manual annotation • Automated NLP Techniques • Named entity extraction & disambiguation • Word sense disambiguation
Use cases • Informing a human agent (i.e. translator) that a certain fragment of text is subject to follow specific translation rules: • proper names • officially regulated translations. • Informing a software agent (i.e. CMS) about the conceptual type of a textual entity in order to enable special processing or indexing
Named entity disambiguation Document Entity Label Mention
Named entity disambiguation – behind the scenes • A difficult problem: • A name can refer to many entities, an entity can have many names • Which interpretation is correct? • Humans are pretty good at this • We have prior knowledge on the ‘usual’ meanings • We can glean the meaning from the context • Things that are related, appear together
Named entity disambiguation – behind the scenes (2) • Prior knowledge:what is the most frequent meaning of ‘London’? • Context: someone using the word ‘London’ in the context of ‘Canada’ is likely to be referring to another London in Ontario
Named entity disambiguation – behind the scenes (3) • Relational similarity: things connected in the knowledge graph tend to appear together
Building blocks of Enrycher • Token-level analysis • Sentence splitting • Tokenization • Lemmatization • Part-of-speech tagging • Entity-level analysis • Named entity extraction • Co-reference resolution • Anaphora resolution • Named entity disambiguation • Document-level analysis • Sentiment analysis • Topic classification • Keyword extraction (not used here)
Using Enrycher • A HTTP service endpoint: send HTML5 in, get enriched HTML5+ITS2.0 out • Multilingual: supports English and Slovene • See http://enrycher.ijs.si/mlw/, or try it from the command line: $ curl -d "<p>Welcome to London</p>" http://enrycher.ijs.si/mlw/en/entityIdent.html5its2 <p>Welcome to <span its-ta-ident-ref="http://dbpedia.org/resource/London" its-ta-class-ref="http://schema.org/Place">London</span></p>
Enrycher Integrated in Okapi • The Okapi Framework is an open-source and cross-platform set of components designed to help building localization processes. • One of its components is a client of the Enrycher services. • Text Analysis annotations can be applied to any document in a format supported by the Okapi filters.
One example of usage ofthe Enrycher Web services Enrycher Server Extraction Step Enrycher Step OtherSteps… Trans-Kit Creation Step Term Extraction Step InputFile Translation Kit XLIFF Terms
Enrycher Step • Convert batches of segments (in Okapi’s internal format) into HTML paragraphs and send them to the Enrycher service. • Converts back the annotated paragraphs into Okapi’s internal format. • Next steps can use the Text Analysis metadata, e.g. XLIFF output, OmegaT comments, etc.
Term Extraction Step • The Term Extraction Step offers various simple ways to guess terms in a source content. • One of its methods is to re-use the content annotated with the Text Analysis metadata to feed the list of term candidates.
Questions? • Enrycher:http://enrycher.ijs.si/ • Okapi Framework:http://okapi.opentag.com/ • ITS 2.0:http://www.w3.org/TR/its20/