1 / 20

CJK (Han) Characters in UNICODE

The Impact of IDN Registration policy by UNICODE variants issue -- Case Study on Chinese Characters Vincent WS Chen TWNIC October 28, 2002. CJK (Han) Characters in UNICODE. IDN proposed standards will adopt Unicode 3.2 CJK Unified Ideographs: 4E00-9FAF 3400-4DBF(Extension A)

tonya
Download Presentation

CJK (Han) Characters in UNICODE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Impact of IDN Registration policy by UNICODE variants issue-- Case Study on Chinese CharactersVincent WS ChenTWNICOctober 28, 2002

  2. CJK (Han) Characters in UNICODE • IDN proposed standards will adopt Unicode 3.2 • CJK Unified Ideographs: 4E00-9FAF • 3400-4DBF(Extension A) • 20000-2A6DF(Extension B) • CJK Compatibility Ideographs: • F900-FAFF • 2F800-2FA1F(Supplement)

  3. UNICODE and Local Encoding Chinese BIg5 UNICODE ……. …….. GB Local encoding …… …… Greek CJK (Han) Characters Japanese JIS Cyrillic …. ….. Korean Hanguel …… …. ……… • Scope of UNICODE is larger than Local Encoding. • Unicode is Character-based, not language-based. • How to specify the characters corresponding to one Language?

  4. Analysis Flow VCP : Valid code point twRV: Recommended variants by .tw cnRV: Recommended variants by .cn CV: Character variants Registered IDN.com IDN.net IDN.org Registered IDN.tw Data Sources Valid Code Point 4E00-9FA5 (20,902) Table Name Conflict Analysis Chinese Character Mapping Table % Collision with twRV % Collision with cnRV % Collision with twRV and cnRV % Collision with CV Results

  5. Chinese Character Mapping Table (CCMT) --- Sources of Character Codes Based on the USC, CNS 14649, published in 2002, and referred to as the Mapping Table Source. The range of codes is described below: Block Name Code Range CJK Unified Ideographs 4E00-9FA5 (20,902) Character for registration (Valid code point): all Chinese character codes in the Mapping Table Source (20,902) Primary corresponding character (Recommended Variants by .tw) : T-source Chinese character codes in the Mapping Table Source (18,368) Secondary corresponding character (Recommended variants by .cn) : G-source Chinese character codes in the Mapping Table Source (20,902) Relevant character (Character variants): all Chinese character codes in the Mapping Table Source

  6. Chinese Character Mapping Table (CCMT) ---- Table format

  7. Chinese Character Mapping Table (CCMT)---- Table format (cont.)

  8. Singular-relation character (VCP=twRV=cnRV=CV): 13888(66.4%) VCP=twRV≠cnRV: 2783 (13.3%) VCP=cnRV≠twRV: 2453(11.7%) VCP≠(twRV=cnRV): 333(1.6%) VCP≠twRV≠SCR: 387(1.9%) Chinese Character Mapping Table (CCMT)---- Table characters

  9. Chinese Character Mapping Table(CCMT)for Chinese Domain Name

  10. Chinese Character Mapping Table (CCMT)for Chinese Domain Name • The table draft is prepared by the CCMT Task force • organized by TWNIC from January, 2002. • Task force members have 9 experts from • language linguist, computer experts and DNS experts. • The table draft has submitted to the Bureau of Standards, • Ministry of Economic Affairs to final review. • This table is also reviewed by language linguist invited • by CDNC members now. • The CNS Standard version will be published on • December, 2002 tentatively.

  11. Analysis Flow VCP : Valid code point twRV: Recommended variants by .tw cnRV: Recommended variants by .cn CV: Character variants Registered IDN.com IDN.net IDN.org Registered IDN.tw Data Sources Valid Code Point 4E00-9FA5 (20,902) Table Name Conflict Analysis Chinese Character Mapping Table % Collision with twRV % Collision with cnRV % Collision with twRV and cnRV % Collision with CV Results

  12. Case Study – Data Sources CJK Han char. IDN: any character in that IDN within CJK Unified Ideographs character (VCP) IDN.tw: any character in that IDN within the scope of Big5 characters

  13. Apply Mapping Table to Case I ~ IV Convert to twRV- collision with twRV 竹叶青竹葉青 竹葉青竹葉青 Convert to cnRV  collision with cnRV 万事如意万事如意 萬事如意万事如意 Convert to CV  collision with CV 一个一个、一個、一箇 一個一个、一個、一箇 Case Study—Method for collision calculation

  14. Case Study– Result (only CJK domain name)

  15. Real case in IDN.com 为什么 为什麽为甚么 為什么- 為什麼 為甚麼 Case Study Example six registered name should be as one name

  16. Case Study -- idn.tw Example • Current valid code point for IDN.tw is Big5 character • set(13,051) less than in the CCMT Table VCP(20,902) • 2. idn.tw implements current tentative TC/SC mapping table • (old version) is a little different from CCMT table. • 3. Even the applied table is a little different, but number of • the name conflict in the case study is reduced hugely.

  17. Case Study -- real registered IDN.com name collision examples 財產財産财产財產保險财产保险財產稅财产税財產管理財産管理财产管理財神财神財神到财神到財神爺财神爷 运财運財运货汽车運貨汽車运输運輸运输学運輸學运输服务運輸服務运输设备運輸設備運転運轉 龍圖蛇業龙图蛇业 龍之杰醫院龙之杰医院龍之杰集團龙之杰集团 歯科材料齒科材料齿科材料 黃金時代黄金时代黄金時代 黃山中旅黄山中旅黃山之旅黄山之旅黃山國旅黄山国旅黃山旅遊黄山旅遊黃帝黄帝 鹿儿岛鹿兒島鹿儿岛大学鹿児島大学鹿児島市鹿兒島市鹿児島銀行鹿兒島銀行鹿岛鹿島鹿嶋鹿岛建设鹿島建設 麻将麻將麻将世界麻將世界麻将桌麻將桌麻将馆麻將館

  18. Case Study– Conclusion • IDN.com case: • If no any mechanisms to reduce name confusion, • About 18% to 23% of registered IDN.com names has • name conflict problem. • IDN.net case: • About 16% to 21% • IDN.org case: • About 15% to 20% • IDN.tw case: • Very few percentage of name conflict by applying • mapping table mechanism.

  19. Case Study– Conclusion (cont.) • Without any reducing name conflict mechanisms, more registered • IDN names, more percentage of name conflicts will be happened. • (for example: more percentage of idn.com’s name conflict than idn.org) • In Chinese characters case, apply recommended variants rule can • reduce major name conflict and apply character variants • rule can also improve to reduce name conflict. • If we expand the valid code point from CJK Unified Ideographs • 4E00-9FA5 (20,920) to whole CJK Unicode code point (68,156), then • the situation is worse than this case study.

  20. Discussion ?

More Related