310 likes | 515 Views
Chinese Information Processing: Basic Concepts and Practice. Week 2: Issues on Encoding Chinese Characters. Character Standard Set. Character sets are standard sets of characters established for two main purposes: 1. education (non-coded) 2. computing (coded).
E N D
Chinese Information Processing: Basic Concepts and Practice Week 2: Issues on Encoding Chinese Characters
Character Standard Set Character sets are standard sets of characters established for two main purposes: 1. education (non-coded) 2. computing (coded).
Non-coded Character Set: Hanzi in China Xiàndài Hànyǔ Tōngyòng Zìbiǎo 现代汉语通用字表 (Commonly Used Characters in Modern Chinese), published on March 25, 1988. It is a standardized list of 7,000 hanzi defined in Among these characters, 3,500 are chángyòng zì 常用字(frequently used characters) and 1,000 are cì chángyòng zì 次常用字 (secondary frequently used characters)
Non-coded Character Set: Hanzi in China Jiǎnhuàzì zǒngbiǎo简化字总表 (Simplified Character Table) enumerates 2,249 simplified hanzi.
Non-coded Character Set: Hanzi in Taiwan 1. The basic set of hanzi in Taiwan is listed in a table called 常用國字標準字體表chángyòng guózì biāozhǔn zìtǐ biǎo (The Table of Standard Commonly Used Chinese Characters). It enumerates 4,808 hanzi. 2. An additional set of 6,431 hanzi is defined in 次常用國字標準字體表cìchángyòng guózì biāozhǔn zìtǐ biǎo (The Table of Standard Secondary Commonly Used Chinese Characters). 3. 18,480 rare hanzi are defined in 罕用字體表hǎnyong zìtǐ biǎo (The Table of Rarely Used Characters) 4. 18,609 hanzi variants are defined in 異體國字字表yìtǐ guózì zìbiǎo (The Table of Character Variants) (Source: Lunde, 1999. p. 68)
Coded Character Set: ASCII ASCII: American Standard Code for Information Interchange In 1963, ASA (The American Standards Association ) announced the American Standard Code for Information Interchange (ASCII). Total number 128
Coded Character Set in China: GB GB is an abbreviation of Guo-jia Biao-zhun, or "National Standard”
Coded Character Set in China: GB 2312-80 symbols (94) numerals (72) ISO 646-CN (94 full-width characters) hiragana (83) katakana (86) Greek alphabet (48) Cyrillic (Russian) alphabet (66) pinyin and bopomofo characters (26, 37) line-drawing elements (76) hanzi level 1 (3,755, ordered by pinyin reading) hanzi level 2 (3,008, ordered by Chinese character radical, then stroke)
GB 2312-80 Table Row 1 (0x81):丂丄丅丆丏丒丗丟丠両丣並丩丮丯丱丳丵丷丼乀乁乂乄乆乊乑乕乗乚乛乢乣乤乥乧乨乪乫乬乭乮乯乲乴乵乶乷乸乹乺乻乼乽乿亀亁亂亃亄亅亇亊亐亖亗亙亜亝亞亣亪亯亰亱亴亶亷亸亹亼亽亾仈仌仏仐仒仚仛仜仠仢仦仧仩仭仮仯仱仴仸仹仺仼仾伀伂伃伄伅伆伇伈伋伌伒伓伔伕伖伜伝伡伣伨伩伬伭伮伱伳伵伷伹伻伾伿佀佁佂佄佅佇佈佉佊佋佌佒佔佖佡佢佦佨佪佫佭佮佱佲併佷佸佹佺佽侀侁侂侅來侇侊侌侎侐侒侓侕侖侘侙侚侜侞侟価侢Row 2 (0x82):侤侫侭侰侱侲侳侴侶侷侸侹侺侻侼侽侾俀俁係俆俇俈俉俋俌俍俒俓俔俕俖俙俛俠俢俤俥俧俫俬俰俲俴俵俶俷俹俻俼俽俿倀倁倂倃倄倅倆倇倈倉倊個倎倐們倓倕倖倗倛倝倞倠倢倣値倧倫倯倰倱倲倳倴倵倶倷倸倹倻倽倿偀偁偂偄偅偆偉偊偋偍偐偑偒偓偔偖偗偘偙偛偝偞偟偠偡偢偣偤偦偧偨偩偪偫偭偮偯偰偱偲偳側偵偸偹偺偼偽傁傂傃傄傆傇傉傊傋傌傎傏傐傑傒傓傔傕傖傗傘備傚傛傜傝傞傟傠傡傢傤傦傪傫傭傮傯傰傱傳傴債傶傷傸傹傼Row 3 (0x83):傽傾傿僀僁僂僃僄僅僆僇僈僉僊僋僌働僎僐僑僒僓僔僕僗僘僙僛僜僝僞僟僠僡僢僣僤僥僨僩僪僫僯僰僱僲僴僶僷僸價僺僼僽僾僿儀儁儂儃億儅儈儉儊儌儍儎儏儐儑儓儔儕儖儗儘儙儚儛儜儝儞償儠儢儣儤儥儦儧儨儩優儫儬儭儮儯儰儱儲儳儴儵儶儷儸儹儺儻儼儽儾兂兇兊兌兎兏児兒兓兗兘兙兛兝兞兟兠兡兣兤兦內兩兪兯兲兺兾兿冃冄円冇冊冋冎冏冐冑冓冔冘冚冝冞冟冡冣冦冧冨冩冪冭冮冴冸冹冺冾冿凁凂凃凅凈凊凍凎凐凒凓凔凕凖凗
Coded Character Set in Taiwan: Big 5 (so called because it was drawn up by "five large computer makers") Big-5 symbols (157) symbols (157) symbols (94) hanzi level 1 (5,401 Chinese characters ordered by number of strokes, then radical) hanzi level 2 (7,652 Chinese characters ordered by number of strokes, then radical)
Big Five Table Row 1 (0xA1): ,、。.‧;:?!︰…‥﹐﹑﹒·﹔﹕﹖﹗|–︱—︳╴︴﹏()︵︶{}︷︸〔〕︹︺【】︻︼《》︽︾〈〉︿﹀「」﹁﹂『』﹃﹄﹙﹚﹛﹜﹝﹞‘’“”〝〞‵′#&*※§〃○●△▲◎☆★◇◆□■▽▼㊣℅¯ ̄_ˍ﹉﹊﹍﹎﹋﹌﹟﹠﹡+-×÷±√<>=≦≧≠∞≒≡﹢﹣﹤﹥﹦~∩∪⊥∠∟⊿㏒㏑∫∮∵∴♀♂⊕⊙↑↓←→↖↗↙↘∥∣/Row 2 (0xA2):\∕﹨$¥〒¢£%@℃℉﹩﹪﹫㏕㎜㎝㎞㏎㎡㎎㎏㏄°兙兛兞兝兡兣嗧瓩糎▁▂▃▄▅▆▇█▏▎▍▌▋▊▉┼┴┬┤├▔─│▕┌┐└┘╭╮╰╯═╞╪╡◢◣◥◤╱╲╳0123456789ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ〡〢〣〤〥〦〧〨〩十卄卅ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvRow 4 (0xA4):一乙丁七乃九了二人儿入八几刀刁力匕十卜又三下丈上丫丸凡久么也乞于亡兀刃勺千叉口土士夕大女子孑孓寸小尢尸山川工己已巳巾干廾弋弓才丑丐不中丰丹之尹予云井互五亢仁什仃仆仇仍今介仄元允內六兮公冗凶分切刈勻勾勿化匹午升卅卞厄友及反壬天夫太夭孔少尤尺屯巴幻廿弔引心戈戶手扎支文斗斤方日曰月木欠止歹毋比毛氏水火爪父爻片牙牛犬王丙Row 5 (0xA5):世丕且丘主乍乏乎以付仔仕他仗代令仙仞充兄冉冊冬凹出凸刊加功包匆北匝仟半卉卡占卯卮去可古右召叮叩叨叼司叵叫另只史叱台句叭叻四囚外央失奴奶孕它尼巨巧左市布平幼弁弘弗必戊打扔扒扑斥旦朮本未末札正母民氐永汁汀氾犯玄玉瓜瓦甘生用甩田由甲申疋白皮皿目矛矢石示禾穴立丞丟乒乓乩亙交亦亥仿伉伙伊伕伍伐休伏仲件任仰仳份企伋光兇兆先全
CJKV Character Set Server This is the site that generates properly-encoded CJKV character sets to be displayed directly in your browser or sent to you (in uuencoded form, if requested or necessary) via e-mail. http://www.oreilly.com/~lunde/cjkv-char.html
Coded Character Set: Unicode U.S. computer firms began work in the first half of the 1980s on multilingual character sets and multilingual character encoding systems, and Xerox Corporation and IBM Corporation successfully implemented computer systems based on their research results. The Xerox researchers then proselytized their work to other U.S. software firms, and they were eventually successful in launching a U.S. industry project called Unification Code, or Unicode, the goal of which was to unify all of the worlds character sets into a single large character set.
ISO/IEC 10646-1: 1993 ISO 646 ISO 8859-1 Eastern European accented characters International Phonetic Alphabet (IPA) Greek (including accented characters, "monotoniko" and "polytoniko") Cyrillic, Georgian and Armenian Hebrew Arabic characters (all four forms: initial, medial, final and stand-alone) Indian subcontinent character sets (including Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada and Malayalam) Thai and Lao Chinese/Japanese/Korean (CJK) ideographic characters (including hangul, katakana, hiragana, and bopomofo ) Mathematical operators and special character forms Box and line drawing characters Geometric shapes and Dingbats Special OCR characters used on cheques Encircled characters and numbers
ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒᄓᄔᄕᄖᄗᄘᄙᄚᄛᄜᄝᄞᄟᄠᄡᄢ გდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰჱჲჳჴ अआइईउऊऋऌऍऎएऐऑऒओऔकखगघङचछजझञटठडढ أؤإئابةتثجحخدذرزسشـفقكلمٌٍَُِّْٕٓٔ٤٥٦٧٨
How to Encode Characters The ASCII letters are represented by binary numbers (made up of zeroes and ones) in a character code table. ASCII codes are represented by 7 zeros and ones, so they are called 7-bit codes. 7 bits are called 1 byte.
Encoding Chinese Characters Chinese has much more characters. 7-bit encoding cannot cover all the characters. So two 8-bit (2 bytes) encoding method is used. “啊” the first byte is 0110000,the second byte is 0100001. That means this character is located in zone 16 (0110000) and the first position (0100001) .
Character Input Issue • Input characters based on the shape • Radical input • Handwriting input • OCR • Input characters based on the sound • Type in using Pinyin or Zhuyin • Speech-to-text (voice recognition)
Characters Conversion Issue Microsoft Word conversion
Characters Conversion Issue • Online conversion sites • http://www.khngai.com/chinese/tools/convert.php • http://www.chinese-tools.com/tools/converter-tradsimp.html • http://www.popupchinese.com/tools/adso
Next class: Teaching Characters Using online resources Typing Chinese – Penless, NJStar, Microsoft IME Installing good input methods: google , sogo, 紫光… Animate characters