210 likes | 376 Views
The Impact of IDN Registration policy by UNICODE variants issue -- Case Study on Chinese Characters Vincent WS Chen TWNIC October 28, 2002. CJK (Han) Characters in UNICODE. IDN proposed standards will adopt Unicode 3.2 CJK Unified Ideographs: 4E00-9FAF 3400-4DBF(Extension A)
E N D
The Impact of IDN Registration policy by UNICODE variants issue-- Case Study on Chinese CharactersVincent WS ChenTWNICOctober 28, 2002
CJK (Han) Characters in UNICODE • IDN proposed standards will adopt Unicode 3.2 • CJK Unified Ideographs: 4E00-9FAF • 3400-4DBF(Extension A) • 20000-2A6DF(Extension B) • CJK Compatibility Ideographs: • F900-FAFF • 2F800-2FA1F(Supplement)
UNICODE and Local Encoding Chinese BIg5 UNICODE ……. …….. GB Local encoding …… …… Greek CJK (Han) Characters Japanese JIS Cyrillic …. ….. Korean Hanguel …… …. ……… • Scope of UNICODE is larger than Local Encoding. • Unicode is Character-based, not language-based. • How to specify the characters corresponding to one Language?
Analysis Flow VCP : Valid code point twRV: Recommended variants by .tw cnRV: Recommended variants by .cn CV: Character variants Registered IDN.com IDN.net IDN.org Registered IDN.tw Data Sources Valid Code Point 4E00-9FA5 (20,902) Table Name Conflict Analysis Chinese Character Mapping Table % Collision with twRV % Collision with cnRV % Collision with twRV and cnRV % Collision with CV Results
Chinese Character Mapping Table (CCMT) --- Sources of Character Codes Based on the USC, CNS 14649, published in 2002, and referred to as the Mapping Table Source. The range of codes is described below: Block Name Code Range CJK Unified Ideographs 4E00-9FA5 (20,902) Character for registration (Valid code point): all Chinese character codes in the Mapping Table Source (20,902) Primary corresponding character (Recommended Variants by .tw) : T-source Chinese character codes in the Mapping Table Source (18,368) Secondary corresponding character (Recommended variants by .cn) : G-source Chinese character codes in the Mapping Table Source (20,902) Relevant character (Character variants): all Chinese character codes in the Mapping Table Source
Chinese Character Mapping Table (CCMT)---- Table format (cont.)
Singular-relation character (VCP=twRV=cnRV=CV): 13888(66.4%) VCP=twRV≠cnRV: 2783 (13.3%) VCP=cnRV≠twRV: 2453(11.7%) VCP≠(twRV=cnRV): 333(1.6%) VCP≠twRV≠SCR: 387(1.9%) Chinese Character Mapping Table (CCMT)---- Table characters
Chinese Character Mapping Table(CCMT)for Chinese Domain Name
Chinese Character Mapping Table (CCMT)for Chinese Domain Name • The table draft is prepared by the CCMT Task force • organized by TWNIC from January, 2002. • Task force members have 9 experts from • language linguist, computer experts and DNS experts. • The table draft has submitted to the Bureau of Standards, • Ministry of Economic Affairs to final review. • This table is also reviewed by language linguist invited • by CDNC members now. • The CNS Standard version will be published on • December, 2002 tentatively.
Analysis Flow VCP : Valid code point twRV: Recommended variants by .tw cnRV: Recommended variants by .cn CV: Character variants Registered IDN.com IDN.net IDN.org Registered IDN.tw Data Sources Valid Code Point 4E00-9FA5 (20,902) Table Name Conflict Analysis Chinese Character Mapping Table % Collision with twRV % Collision with cnRV % Collision with twRV and cnRV % Collision with CV Results
Case Study – Data Sources CJK Han char. IDN: any character in that IDN within CJK Unified Ideographs character (VCP) IDN.tw: any character in that IDN within the scope of Big5 characters
Apply Mapping Table to Case I ~ IV Convert to twRV- collision with twRV 竹叶青竹葉青 竹葉青竹葉青 Convert to cnRV collision with cnRV 万事如意万事如意 萬事如意万事如意 Convert to CV collision with CV 一个一个、一個、一箇 一個一个、一個、一箇 Case Study—Method for collision calculation
Real case in IDN.com 为什么 为什麽为甚么 為什么- 為什麼 為甚麼 Case Study Example six registered name should be as one name
Case Study -- idn.tw Example • Current valid code point for IDN.tw is Big5 character • set(13,051) less than in the CCMT Table VCP(20,902) • 2. idn.tw implements current tentative TC/SC mapping table • (old version) is a little different from CCMT table. • 3. Even the applied table is a little different, but number of • the name conflict in the case study is reduced hugely.
Case Study -- real registered IDN.com name collision examples 財產財産财产財產保險财产保险財產稅财产税財產管理財産管理财产管理財神财神財神到财神到財神爺财神爷 运财運財运货汽车運貨汽車运输運輸运输学運輸學运输服务運輸服務运输设备運輸設備運転運轉 龍圖蛇業龙图蛇业 龍之杰醫院龙之杰医院龍之杰集團龙之杰集团 歯科材料齒科材料齿科材料 黃金時代黄金时代黄金時代 黃山中旅黄山中旅黃山之旅黄山之旅黃山國旅黄山国旅黃山旅遊黄山旅遊黃帝黄帝 鹿儿岛鹿兒島鹿儿岛大学鹿児島大学鹿児島市鹿兒島市鹿児島銀行鹿兒島銀行鹿岛鹿島鹿嶋鹿岛建设鹿島建設 麻将麻將麻将世界麻將世界麻将桌麻將桌麻将馆麻將館
Case Study– Conclusion • IDN.com case: • If no any mechanisms to reduce name confusion, • About 18% to 23% of registered IDN.com names has • name conflict problem. • IDN.net case: • About 16% to 21% • IDN.org case: • About 15% to 20% • IDN.tw case: • Very few percentage of name conflict by applying • mapping table mechanism.
Case Study– Conclusion (cont.) • Without any reducing name conflict mechanisms, more registered • IDN names, more percentage of name conflicts will be happened. • (for example: more percentage of idn.com’s name conflict than idn.org) • In Chinese characters case, apply recommended variants rule can • reduce major name conflict and apply character variants • rule can also improve to reduce name conflict. • If we expand the valid code point from CJK Unified Ideographs • 4E00-9FA5 (20,920) to whole CJK Unicode code point (68,156), then • the situation is worse than this case study.