300 likes | 465 Views
Chinese Information Processing (I): Basic Concepts and Practice. Unit 1: The Chinese Language and Chinese, Script and Software. Noménclature. · Mandarin - Guanhua, an official language used in the court, the language of officials · Guoyu - National language.
E N D
Chinese Information Processing (I): Basic Concepts and Practice Unit 1: The Chinese Language and Chinese, Script and Software
Noménclature · Mandarin - Guanhua, an official language used in the court, the language of officials · Guoyu - National language. · Putonghua - Common Speech, Common Language · Huayu or Huawen – Singapore or overseas ·Hanwen – used in Korea and Japan · Zhongguohua – Languages in China · Zhongwen – alternative to Hanyu, focusing on written language
Chinese dialects • Northen (Beijing) 647,000,000 • Wu (Shanghai) 77,000,000 • Yue (Cantonese Guangzhou) 47,000,000 • Xiang (Hunan Changsha) 46,000,000 • Min South 28,000,000 • (Southern Fujianese Xiamen) • Min North 11,000,000 • (Northern Fujianese Fuzhou, Taiwan) • Hakka (Mei Xian) 37,000,000 • Gan (Jiangxi Nanchang) 22,000,000
Pronunciation • Mutual unintelligible • Northern dialects do not have voice sounds b-, d-, g-, z-, v- and entering tone –p, -t, -k -? • Wu dialect has voiced sounds, entering tones and makes no distinction between z, c, s and zh, ch, sh • Cantonese has entering tones, but no voiced sounds.
Pronunciation • Mutual unintelligible • Northern dialects do not have voice sounds b-, d-, g-, z-, v- and entering tone –p, -t, -k -? • Wu dialect has voiced sounds, entering tones and makes no distinction between z, c, s and zh, ch, sh • Cantonese has entering tones, but no voiced sounds.
Tonal differences • The number of tones vary in various dialects Mandarin – 4 tones 1 2 3 4 Yīn Píng Yáng Píng Shǎng Shēng Qù Shēng 55 35 214 51 媽 麻 馬 罵
Tones in Wu and Cantonese Wu Dialect –5 tones
Tones in Wu and Cantonese Cantonese – 9 tones
Why are dialect issues related to Chinese information processing? • 1. When one inputs characters, he may use the pronunciation of characters. When a person’s pronunciation is not standard, the input Pinyin will be incorrect, thus he may not be able to retrieve a proper character. • 2. Since all educated people know the structure of characters, the stroke number, the character components or radicals may be used to input characters. • 3. When voice recognition software is developed, the dialect accents must be taken into consideration. • 4. When OCR software is developed, the character structure must be taken into consideration.
Chinese script: Issues Related to Chinese Information Processing • Number of characters • Structure of characters • Character evolution • Traditional vs. simplified characters
Number of Chinese Characters ============================================ Dates Dynasty or period Name of Dictionary Number -------------------------------------------------------------------------- 100 Eastern Han Shuowen Jiezi 9,353 1615 Ming Zihui 33,179 1716 Qing Kangxi Zidian 47,035 1916 Republic Zhonghua Da zidian 48,000 (Source: Norman, 1988)
Number of Frequently Used Chinese Characters The Language and Script Committee and the Education Commission have published “The Frequently Used Characters of Modern Chinese” which includes 2,500 primarily frequent characters and 1,000 secondarily frequent characters. (Source: Li Xingjian and Fei Jinchang, People’s Daily 9/25/2001.)
Character Structure Charater 好 Componenets 女 子 Strokes 丶 一 丨 丿
Important to remember: • Single characters: 一 , 乙 • Compound characters: 明,海 • Radicals: 女,人,口 • Characters can be decomposed • Characters have some basic components
Character Evolution Source: library.thinkquest.org/C004203/ art/chinese.jpg
Definition of Character, Glyph, Typeface and Font • Character - an abstract notion indicating a class of shapes declared to have the same meaning or form. • Glyph - a specific instance of a character. e.g., 囘回 • Typeface - the printed style of a glyph or character set.中, 中,中,中 • Font - a single instance of a typeface such as specific point size.中,中,中
Traditional vs. simplified characters Simplification of characters has long been a deputed topic in China. Advocating character simplification began in early Republic years. Only after 1949, the simplification of characters was truly implemented. In 1956, the Committee on Language Reform promulgated a list of 515 simplified characters and 54 simplified components or parts. Currently the simplified characters are used in Mainland of China, Singapore. The traditional characters are used in Taiwan, Hong Kong. In overseas Chinese communities, a kind of mixed situation can be observed.
The problem of this dual system caused for Chinese computing. Computer must store two sets of characters making the storage space huge (for display and printing). The input methods based on strokes or components may be different. The radicals or components of traditional and simplified characters are different.谢 - 謝 have different radicals. 后 後 are completely different glyphs.
Problem of Conversion Traditional 後來 皇后 心臟 骯髒 關係 Simplified 后来 皇后 心脏 肮脏 关系 Problem caused by conversion from simplified characters to traditional characters: 后来 => 后來 (後來) 心脏 => 心髒 (心臟) 关系 => 關系 (關係)
Chinese Software • Chinese Word Processors • Chinese Systems • Chinese Windows • Third party Chinese systems
DOS based software • Byx, DOS based simple Chinese word processor. It handles simplified • characters only. GB code. It has only one print font. • NJSTAR, DOS and Window based Chinese word processor, handles both simplified and traditional characters. • Kuochiao, DOS based Chinese system, traditional characters, big5 code • Yitien, same as Kuochiao • CCDOS, DOS based Chinese system, simplified characters, GB code.
Windows based software • Twinbridge http://www.twinbridge.com • Chinese Star http://www.suntendyusa.com/ • Unionway http://www.unionway.com/tea/html/0/1.html • Richwin http://richwin.sina.com.cn/ • Microsoft Cwindows and Pwindows • Microsoft multilingual support 2000 and XP 5.02Install and Use of IME from Office 2000 multilanguage pack • (Mac OS with multilingual support)