200 likes | 310 Views
Variant Issues Project Devanagari Team Venue : C-DAC Pune Date : 16 th Sept. 2011. Part 1 : Overview of evolution of Devanagari Brief sketch of writing system of the language Part 2: Variants Issues as they exist in Devanagari based languages Part 3:
E N D
Variant Issues Project Devanagari Team Venue : C-DAC Pune Date : 16th Sept. 2011
Part 1 : Overview of evolution of Devanagari Brief sketch of writing system of the language Part 2: Variants Issues as they exist in Devanagari based languages Part 3: Registry – Registrar perspective Appendices Structure of the Report
Issues at large 1. Language Vs Script 2. Variants 3. Unicode Normalization 4. Zero Width Joiner (ZWJ) and Zero Width Non-joiner(ZWNJ) 5. Valid characters/combinations that are not valid as per IDNA 2008 protocol
Review of Issues Report Reviewed by : • Andrew Sullivan • Dr. John Klensin • Dr. Nicholas Ostler • Karen Lentz • Kim Davies • Francisco Arias • Naela Sarras • Bal Krishna Bal • Dr. S. Walawalikar • Umamaheswaran • Shashi Pathania • NIXI
Typology of comments 1. Typos/Language. (not many) 2. Rendering issues. (4) 3. Requests for deletions.(6) 4. Terminological corrections. (5-6) 5. Requests for amplification. (8) 6. Request for change in content.
Action Takes on Deletion Request Deletion requests were of three types : 1. Cases which were felt to be out of scope but were relevant to VIP. Action : Pushed in appendices under the head of “Extraneous Issues” 2. Cases which were felt not to be under the purview of ICANN but related to Unicode/Software behavior. Action : Kept and described for relevance. 3. Cases which were felt not to be under the purview of this report. Action : Deleted.
Change in content Please refer this document.
1. Variant Classification in Devanagari 2. Normalization related issues 3. Zero Width Joiner (ZWJ) & Zero Width Non-Joiner(ZWNJ) 4. Software Behavior (in light of rendering engine behavior) 5. IDNA Protocol related issues Focal Issues
Variant Classification - Devanagari Variants that exist because of legacy ways of inputting the same logical character. Earlier versions of Unicode did not have certain characters. In order to generate these characters alternative methods such as the use of Halanta followed by a ZWJ (U+200D) were used. e.g. Eyelash-ra
Variant Classification - Devanagari Variants that exist because of combining characters. These variants exist because Unicode allows for two or more ways of representing certain characters. Unicode handles the issue through Normalization. Thus in the case of Devanāgarī the “nukta” character is the candidate for Normalization . e.g. a sample of two such instances is provided.
Variant Classification - Devanagari Variants that are look alikes - Single character These are the characters which have confusingly similar shapes. However, this category of variants were not considered in the .भारत ccTLD policy as there was a possibility that this approach would result in barring many useful domain names from being registered. e.g.
Variant Classification - Devanagari Variants that are look alikes - Composite character These are conjuncts that look alike and can be easily confused in the small URL bar of the browser. These have been considered as variants in .भारत ccTLD policy. e.g.
Variant Classification - Devanagari Cross-script character variants There is a possibility that scripts(or so to say code-blocks) could be allowed to be mixed within certain gTLDs. Assuming that, a list of cross-script (cross-code block) visual similarities within characters is provided. e.g
Variant Classification - Devanagari Homophonic Variants In Devanāgarī based languages, homophonic variants which admit two homophones (spelling variants as in English color-colour) e.g. हिंदी and हिन्दी do occur but the rules for such variants are ill-defined and could increase the chances of malfeasance. Within the ambit of the ccTLD policy for .भारत such variants have not been considered.
Unicode Normalization Issues Unicode has defined normalization rules but there are still some cases where Unicode has proposed multiple ways of inputting a character/combination without making it part of normalization. e.g. Eyelash-ra Until Unicode identifies such cases as Normalizable, they should be handled as Variants.
ZWJ and ZWNJ Issue ZWJ (U+0200D) and ZWNJ (U+0200C) are code points that have been provided by the Unicode standard to instruct the rendering of a string where the script has the option between joining and non-joining characters. These have been categorized as CONTEXTJ in IDNA 2008. Rule : The preceding character must be a virama. Issue : Some character combinations linguistically do not have different joining or non-joining behavior. In these cases, presence of ZWJ or ZWNJ do not make any visual difference. What amplifies this issue is the fact that this behavior varies from one rendering engine to other.
Software/Application Issues Example : Web-browser What gets displayed as a Domain Name in URL bar of the browser is dependent on the Font that gets applied in the URL bar of the browser. Due to inconsistencies in same point sizes of various default fonts for various scripts, in some cases, the domain name becomes unreadable. Though ideally applications are expected to improve eventually, till the time they do, this remains an issue and an important consideration while identifying variants.
IDNA Protocol Issues Though these issues may not be directly related to “Variant” but are related with the IDNA specifications hence mentioned. 1. Case of 02BC – Comes from different code-block than Devanagari. 2. Case of ya-phala (Bengali Script) – Inputting method suggested by Unicode conflicts with CONTEXTJ rule for ZWJ specified as per IDNA 2008.