200 likes | 379 Views
The Impact of IDN Registration policy by UNICODE variants issue -- Case Study on Chinese Characters Vincent WS Chen TWNIC October, 2002. Analysis Flow. VCP : Valid code point twRV: Recommended variants by .tw cnRV: Recommended variants by .cn CV: Character variants. Registered IDN.com
E N D
The Impact of IDN Registration policy by UNICODE variants issue-- Case Study on Chinese CharactersVincent WS ChenTWNICOctober, 2002
Analysis Flow VCP : Valid code point twRV: Recommended variants by .tw cnRV: Recommended variants by .cn CV: Character variants Registered IDN.com IDN.net IDN.org Registered IDN.tw Valid Code Point 4E00-9FA5 (20,902) Name Conflict Analysis Chinese Character Mapping Table % Collision with twRV % Collision with cnRV % Collision with twRV and cnRV % Collision with CV
Chinese Character Mapping Table (CCMT)for Chinese Domain Name • The table draft is prepared by the CCMT Task force • organized by TWNIC from January, 2002. • Task force members have 9 experts from • language linguist, computer experts and DNS experts. • The table draft has submitted to the Bureau of Standards, • Ministry of Economic Affairs to final review. • The CNS Standard version will be published on • December, 2002 tentatively.
Chinese Character Mapping Table (CCMT) --- Sources of Character Codes Based on the USC, CNS 14649, published in 2002, and referred to as the Mapping Table Source. The range of codes is described below: Block Name Code Range CJK Unified Ideographs 4E00-9FA5 (20,902) Character for registration (Valid code point): all Chinese character codes in the Mapping Table Source (20,902) Primary corresponding character (Recommended Variants by .tw) : T-source Chinese character codes in the Mapping Table Source (18,368) Secondary corresponding character (Recommended variants by .cn) : G-source Chinese character codes in the Mapping Table Source (20,902) Relevant character (Character variants): all Chinese character codes in the Mapping Table Source
Chinese Character Mapping Table (CCMT) ---- Table format and categories
Chinese Character Mapping Table (CCMT)---- Table format and categories (cont.) ?(個(500B)箇(7B87)): sometime个(4E2A) should be recommended by 個(500B), but sometime should be recommended by箇(7B87), depends on its context.
Chinese Character Mapping Table (CCMT)---- Table format and categories (cont.) ?(發(767C)髮(9AEE)): sometime发(53D1)should be recommended by發(767C), but sometime发(53D1)should be recommended by髮(9AEE), depends on its context.
Chinese Character Mapping Table (CCMT)---- Table format and categories (cont.) ?(發(767C)髮(9AEE)): sometime 発(767A) should be recommended by發(767C), but sometime 発(767A) should be recommended by髮(9AEE) depends on its context.
Chinese Character Mapping Table (CCMT)---- Table format and categories (cont.) ?(發(767C)髮(9AEE)): sometime 髪(9AEA)should be recommended by發(767C), but sometime 髪(9AEA)should be recommended by髮(9AEE) depends on its context.
Characters Relationship 1. Singular-relation character: single character VCP = twRV = cnRV 2. Pair-relation character: A pair of characters (VCP1 and VCP2) 2.1 twRV1=cnRV1=TWRV2=cnRV2 2.2 (twRV1=cnRV1=cnRV2)≠TWRV2 2.3 (twRV1=twRV2)≠(cnRV1=cnRV2) 3. Multiple-relation character: (VCP1, VCP2, VCP3 ….) 3.1 with two or more twRV (twRV11, twRB12….) options
Singular-relation character (VCP=twRV=cnRV): 13888(66.4%) VCP=twRV≠cnRV: 2783 (13.3%) VCP=cnRV≠twRV: 2453(11.7%) VCP≠(twRV=cnRV): 333(1.6%) VCP≠twRV≠SCR: 387(1.9%) Chinese Character Mapping Table (CCMT)---- Table characters
Chinese Character Mapping Table(CCMT)for Chinese Domain Name
Case Study -- Sources Han char.IDN: any character in that IDN has CJK Unified Ideographs charcater IDN.tw: Valid code point is in the scope of Big5 code range
Apply Mapping Table to Case I ~ IV Convert to twRV- collision with twRV 竹叶青竹葉青 竹葉青竹葉青 Convert to cnRV collision with cnRV 万事如意万事如意 萬事如意万事如意 Convert to CV collision with CV 一个一个、一個、一箇 一個一个、一個、一箇 Case Study Method
Real case in IDN.com 为什么 为什麽为甚么 為什么- 為什麼 為甚麼 Case Study Example six registered name should be as one name
Case Study -- idn.tw Example • Current valid code point for IDN.tw is Big5(13,051), • less than in the CCMT Tables (20,902) • 2. Current tentative TC/SC mapping table (old version) is • a little different from CCMT tables. • 3. Even the applied table is a little different, but number of • the name conflict is reduced hugely.
Case Study -- real registered IDN name example 財產財産财产財產保險财产保险財產稅财产税財產管理財産管理财产管理財神财神財神到财神到財神爺财神爷 运财運財运货汽车運貨汽車运输運輸运输学運輸學运输服务運輸服務运输设备運輸設備運転運轉 龍圖蛇業龙图蛇业 龍之杰醫院龙之杰医院龍之杰集團龙之杰集团 歯科材料齒科材料齿科材料 黃金時代黄金时代黄金時代 黃山中旅黄山中旅黃山之旅黄山之旅黃山國旅黄山国旅黃山旅遊黄山旅遊黃帝黄帝 鹿儿岛鹿兒島鹿儿岛大学鹿児島大学鹿児島市鹿兒島市鹿児島銀行鹿兒島銀行鹿岛鹿島鹿嶋鹿岛建设鹿島建設 麻将麻將麻将世界麻將世界麻将桌麻將桌麻将馆麻將館
Case Study– Conclusion • IDN.com case: • If no any mechanisms to reduce name confusion, • About 18% to 23% of registered IDN.com names has • Name conflict problem. • IDN.net case: • About 16% to 21% (Consider character variants) • IDN.org case: • About 15% to 20% (Consider character variants) • IDN.tw case: • Very few percentage of name conflict, if we apply • mapping table mechanisms.
Case Study– Conclusion (cont.) • More registered IDN names, more percentage of name • conflicts will be happened. • (more percentage of idn.com’s name conflict than idn.org) • In Chinese case, apply recommended variants rule can • reduce major name conflict and apply character variants • rule can also improve reducing name conflict. • If no any reducing name confusion mechanism, for example, • idn.com (242,512 idn names) will have about 18% to 23% • name confusion. If the number increases, the percentage • will increase too. • If we expand the valid code point from CJK Unified Ideographs • 4E00-9FA5 (20,920) to whole Unicode code point, then • the situation is worse than this case study.