530 likes | 800 Views
Localization and HTML5: Technical Aspects. Felix Sasaki DFKI / W3C Fellow. Pitch: Why this presentation?. HTML5 is the upcoming (or existing) format for content on the Web The Web is becoming multilingual HTML5 localization is essential to make this happen
E N D
Localization and HTML5: Technical Aspects Felix Sasaki DFKI / W3C Fellow
Pitch: Why this presentation? • HTML5 is the upcoming (or existing) format for content on the Web • The Web is becoming multilingual • HTML5 localization is essential to make this happen • Localization workflows with HTML5 input / output need to take various aspects of HTML5 into account – learn more here
Acknowledgement • Thanks to JirkaKosek for introducing the participants of the W3C MultilingualWeb-LT working group to the “do” and “do not” of HTML5 content creation and processing
Overview • HTML5 Serializations + Model • Localization Workflow with HTML5 • Metadata for (HTML5) Localization • What Else?
HTML5 – Serializations + Model • Two serializations <!DOCTYPE html> <html> <head> <metacharset=utf-8> <title>Myexample</title> </head> <body>... </body> </html> <htmlxmlns= "http://www.w3.org/1999/xhtml"> <head> <metacharset="utf-8"/> <title>Myexample</title> </head> <body>... </body> </html>
HTML5 – Serializations + Model • Two serializations: HTML5 vs. XHTML5 <!DOCTYPE html> <html> <head> <metacharset=utf-8> <title>Myexample</title> </head> <body>... </body> </html> <htmlxmlns= "http://www.w3.org/1999/xhtml"> <head> <metacharset="utf-8"/> <title>Myexample</title> </head> <body>... </body> </html>
HTML5 – Serializations + Model • Two serializations: HTML5 vs. XHTML5 <!DOCTYPE html> <html> <head> <metacharset=utf-8> <title>Myexample</title> </head> <body>... </body> </html> <htmlxmlns= "http://www.w3.org/1999/xhtml"> <head> <metacharset="utf-8"/> <title>Myexample</title> </head> <body>... </body> </html> One Document Object Model (DOM) document.getElementsByTagName("meta")
Rational • More than 90% of the Web is invalid • See browser “Opera” MAMA report • XHTML was revolution • HTML5 is evolution • Parsing algorithm for existing Web content • Two serializations as input • Detailed error handling • Ouput: one DOM
Overview • HTML5 Serializations + Model • Localization Workflow with HTML5 • Metadata for (HTML5) Localization • What Else?
Localization Workflow with HTML5 HTML5 as XML HTML5 XLIFF-based Localization HTML5 as HTML XHTML5 HTML5 as HTML witherrors
Localization Workflow with HTML5 HTML5 as XML HTML5 XLIFF-based Localization HTML5 as HTML XHTML5 HTML5 as HTML witherrors HTML5 parsing > DOM creation > (XML serialization) > XLIFF generation
Localization Workflow with HTML5 HTML5 as XML HTML5 XLIFF-based Localization HTML5 as HTML XHTML5 HTML5 as HTML witherrors Transformation > XHTML5 > HTML5 parsing > HTML5 or XHTML5 HTML5 parsing > DOM creation > (XML serialization) > XLIFF generation
Localization Workflow with HTML5 Central: HTML5 parsinglibrary, e.g. validator.nu HTML5 as XML HTML5 XLIFF-based Localization HTML5 as HTML XHTML5 HTML5 as HTML witherrors Transformation > XHTML5 > HTML5 parsing> HTML5 or XHTML5 HTML5 parsing> DOM creation > (XML serialization) > XLIFF generation
Overview • HTML5 Serializations + Model • Localization Workflow with HTML5 • Metadata for (HTML5) Localization • What Else?
Metadata for (HTML5) Localization:ITS 2.0 • “Internationalization Tag Set” 2.0 • Set of disjoint metadata items (“data categories”) for XML and HTML5 • Translate, Localization Note, Terminology, Directionality, Ruby, Language Information, Elements Within Text, Domain, Locale Filter, Provenance, Text Analysis Annotation, External Resource, Target Pointer, Id Value, Preserve Space, Localization Quality Issue, Localization Quality Précis, MT Confidence, Allowed Characters, Storage Size
Metadata for (HTML5) Localization:ITS 2.0 • “Internationalization Tag Set” 2.0 • Some items are part of HTML5 spec • Translate, Localization Note, Terminology, Directionality, Ruby, Language Information, Elements Within Text, Domain, Locale Filter, Provenance, Text Analysis Annotation, External Resource, Target Pointer, Id Value, Preserve Space, Localization Quality Issue, Localization Quality Précis, MT Confidence, Allowed Characters, Storage Size
“Translate” <!DOCTYPE html> <html> <head> <meta charset=utf-8> <title>Translate flag test: Default</title> </head> <body> <p>The <span translate=no>World Wide Web Consortium</span> is making the World Web Web worldwide!</p> </body> </html>
ITS “global rules” • XPath based metadata approach • Attach metadata to several nodes • Specify metadata for a document format or (HTML) template • Example: map proprietary HTML to ITS “translate” <its:rules ...> <its:translateRule translate="no" selector="//h:*[@class='notranslate']"/> </its:rules>
ITS “inline”, e.g. global rules in HTML5 • “Work” inside HTML “script” element with proper mime type • Upcoming: application/its+xml • If possible: avoid; use linked rules <!DOCTYPE html> <html> ... <script type="application/xml“ <its:rules ...> <its:translateRuletranslate="no" selector="//h:code"/> </its:rules> </script> ... </html>
“Terminology” <!DOCTYPE html> <html lang=en> <head> <meta charset=utf-8> <title>Terminology test: default</title> </head> <body> <p>We need a new <span its-term=yes>motherboard</span> </p> </body> </html>
“Directionality” <!DOCTYPE html> <html lang=en> <head> <meta charset=utf-8> <title>Dir test: Default</title> </head> <body> <p>In Arabic, the title <quote dir=rtllang=ar>نشاطالتدويل، W3C</quote> means <quote>Internationalization Activity, W3C</quote>.</p> </body> </html>
“Ruby” – XHTML vs. HTML5 <ruby> <rb>日本</rb> <rt>にっぽん</rt> </ruby> <ruby> <rb>電</rb> <rt>でん</rt> </ruby> <ruby> <rb>気</rb> <rt>き</rt> </ruby> <ruby> 日本 <rt>にっぽん</rt> 電 <rt>でん</rt> 気 <rt>き</rt> </ruby
“Domain” <its:rules xmlns:its="http://www.w3.org/2005/11/its" version="2.0"> <its:domainRule selector="/html/body" domainPointer="/html/head/meta[@name='keywords']/@content” domainMapping="automotive auto, medical medicine, 'criminal law' law, 'property law' law"/>/> </its:rules> Means • Express domain information about content of „body“ element • Domain information is in the „meta“ element • Optional mapping of source content domains, e.g. automotive > auto Purpose: not define a domain vocabulary, but pass domain information to application (MT system, MT training tool)
“Storage Size” <!DOCTYPE html> <html lang=en> <head> <meta charset=utf-8> <title>Example</title> </head> <body> <p>String to translate:</p> <p contenteditable=true id=123 its-storage-size=25>Papua New-Guinea</p> <p contenteditable=true id=139 its-storage-size=25>Dominican Replubic</p> </body> </html>
“Translate” in XML and HTML5 • ITS namespace vs. HTML5 native “translate” attribute <article...> ... <para> Youneed a new <span its:translate="no"> motherboard</span></para> ...</article> <!DOCTYPE html> <html>... <p>Youneed a new <span translate="no"> motherboard</span>...</p>... </html>
“Terminology” in XML and HTML5 • ITS namespace vs. HTML5 its-* “term” attribute <article...> ... <para> Youneed a new <span its:term="yes"> motherboard</span></para> ...</article> <!DOCTYPE html> <html>... <p>Youneed a new <span its-term=yes> motherboard</span>...</p>... </html>
“Quality” metadata in the browser <html>… <script id=its-standoff-1 type=application/xml> <its:locQualityIssuesxml:id="lq1"…> <its:locQualityIssue locQualityIssueType="misspelling" …/> <its:locQualityIssue locQualityIssueType="typographical” …/> </its:locQualityIssues> </script>…… <p> <span its-loc-quality-issues-ref=#lq1>c'es</span> le contenu</p> …</html> • See life demo at http://tinyurl.com/its2-lq-html5
Rationale for its-* • HTML attributes are case insensitive; no qualified namespace • ITS 1.0/2.0 attributes use • camel case:its:locNote, its:termInfo, its:withinText, … • ITS namespace • Good news: conversion to HTML5 is straight forward • its-loc-note, its-term-info, its-within-text, …
Effect on Localization Workflow translate, dir, its-locNote, its-termInfo, … : „interpretation“ likeits:translate, its:termInfo, ... HTML5 as XML HTML5 XLIFF-based Localization HTML5 as HTML XHTML5 HTML5 as HTML witherrors Transformation > XHTML5 > HTML5 parsing > HTML5 or XHTML5 HTML5 parsing > DOM creation > (XML serialization) > XLIFF generation
Overview • HTML5 Serializations + Model • Localization Workflow with HTML5 • Metadata for (HTML5) Localization • What Else?
Other HTML versions • “HTML legacy content”:no native supported for its-* • HTML validation tools will complain • Good news: its-* attributes “work” in older versions of HTML (e.g. 3.2 or 4.01), e.g. recognized by HTML DOM parser
Tool support • its-* attributes in the pipeline for W3C HTML validator • Lot’s of XML+ITS / HTML5+ITS (partially) sensitive tools being developed in W3C MultilingualWeb-LT working group • HTML5 validation with ITS 2.0 metadata, XML tool chain, online MT system, translation package creation, simple MT, HTML-to-TMS roundtrip, CMS support (Drupal), quality check, browser based review, named entity annotation, … • *Very raw* details (but further links!) at http://tinyurl.com/its2-use-cases
What’s missing? • ITS 2.0 localization focuses on HTML markup • Elements, attributes • Server side / client side scripting content not taken into account • JavaScript, PHP, … • Using ITS 2.0 in HTML5 with XLIFF: still many bits missing • But: moving forward this week
Overview again … • HTML5 Serializations + Model • Localization Workflow with HTML5 • Metadata for (HTML5) Localization • What Else?
ありがとうございました。 Localization and HTML5: Technical Aspects Felix Sasaki DFKI / W3C Fellow
Localization and HTML5: Potential Slides for “Challenges and Promises”
What is HTML5? • DOM specification • Parsing algorithm to cover most of current (and future) Web content • A set of APIs • Part of HTML5 specification • Defined in separate documents Explanatory and other documents For markup authors, XML tool chains etc.
HTML5 – Serializations + Model • Two serializations <!DOCTYPE html> <html> <head> <metacharset=utf-8> <title>Myexample</title> </head> <body>... </body> </html> <htmlxmlns= "http://www.w3.org/1999/xhtml"> <head> <metacharset="utf-8"/> <title>Myexample</title> </head> <body>... </body> </html>
HTML5 – Serializations + Model • Two serializations: HTML5 vs. XHTML5 <!DOCTYPE html> <html> <head> <metacharset=utf-8> <title>Myexample</title> </head> <body>... </body> </html> <htmlxmlns= "http://www.w3.org/1999/xhtml"> <head> <metacharset="utf-8"/> <title>Myexample</title> </head> <body>... </body> </html>
HTML5 – Serializations + Model • Two serializations: HTML5 vs. XHTML5 <!DOCTYPE html> <html> <head> <metacharset=utf-8> <title>Myexample</title> </head> <body>... </body> </html> <htmlxmlns= "http://www.w3.org/1999/xhtml"> <head> <metacharset="utf-8"/> <title>Myexample</title> </head> <body>... </body> </html> One Document Object Model (DOM) document.getElementsByTagName("meta")
Rational • More than 90% of the Web is invalid • See browser “Opera” MAMA report • XHTML was revolution • HTML5 is evolution • Parsing algorithm for existing Web content • Two serializations as input • Detailed error handling • Ouput: one DOM
HTML5: current state • Developed within • W3C: HTML5 to become a standard • WHATWG http://www.whatwg.org/ - HTML as a “living standard” • High pressure in W3C to wrap up • Rationale: “We need one stable version” • At the same time: “We need more features!” – e.g. • ITS 2.0 • HTML accessibility http://www.w3.org/WAI/PF/html-task-force
Plan: HTML5 finalized by 2014 • Finish HTML5 specification in W3C by 2014 • Work closely with WHATWG and others on new features, for next version • Don’t try to get everything into HTML5! • Allow for extension specifications, e.g. ITS 2.0 • Moving forward at their own pace
HTML5 time line 2012 2013 2014 2015 2016 ---------- ---------- ---------- ---------- ---------- HTML5.0 CR start ...CR, LC Rec ... ... HTML5.1 FPWD --- LC + CR ...CR Rec From http://dev.w3.org/html5/decision-policy/html5-2014-plan.html
Challenge: many extensions • HTML+RDFa - RDFa WG • Web Intents - Web Apps WG / Device APIs WG • HTML Editing APIs - HTML Editing APIs CG • HTML Media Capture - Device APIs WG • Media Capture and Streams - Device APIs WG / WebRTC WG • Media Fragments URI - Media Fragments WG • Encrypted Media Extensions - HTML WG • Media Source Extensions - HTML WG • ... manyrelvaluespecifications registered atthe link type registry – Microformats
Promises: many extensions • See last slide • That also means:Easy of adding localization features to HTML5
HTML5 and Localization Issues • Localization: Mostly covered by ITS 2.0 • Technical aspects: see presentation from Felix Sasaki on Tuesday • Important: get by-in by browser vendors • Awareness of ITS 2.0 • Fostering browser based implementations • Easy of adoption for web developers
Metadata for (HTML5) Localization:ITS 2.0 • “Internationalization Tag Set” 2.0 • Set of disjoint metadata items (“data categories”) for XML and HTML5 • Translate, Localization Note, Terminology, Directionality, Ruby, Language Information, Elements Within Text, Domain, Locale Filter, Provenance, Text Analysis Annotation, External Resource, Target Pointer, Id Value, Preserve Space, Localization Quality Issue, Localization Quality Précis, MT Confidence, Allowed Characters, Storage Size
Metadata for (HTML5) Localization:ITS 2.0 • “Internationalization Tag Set” 2.0 • Some items are part of HTML5 spec • Translate, Localization Note, Terminology, Directionality, Ruby, Language Information, Elements Within Text, Domain, Locale Filter, Provenance, Text Analysis Annotation, External Resource, Target Pointer, Id Value, Preserve Space, Localization Quality Issue, Localization Quality Précis, MT Confidence, Allowed Characters, Storage Size
HTML5 and Internationalization Issues • Many things to do • Ruby • International layout (work done mostly via CSS3 modules) • Here: our most favorite i18n core issues