230 likes | 485 Views
The Ideographic Composition Scheme and Its Applications in Chinese Text Processing. Qin LU Department of Computing, The Hong Kong Polytechnic University Introduction to Ideograph Description Characters The ideographic composition scheme The Hong Kong Glyph Specification Project.
E N D
The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University • Introduction to Ideograph Description Characters • The ideographic composition scheme • The Hong Kong Glyph Specification Project
What are Ideograph Description Characters • 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units
Characteristics of Ideographs • Ideograph characters are open formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components • Natural in the formation of characters • Examples: 2 components => • Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation
Problems with ideograph Character Encoding • Each character is treated as a different symbol, and thus given a code point • Code point assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. • When new character is created, codepoint allocation is needed, potentially endless standardization process • Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort
Introduction of IDCs • Work started in 1995 by ISO/IEC SC2/WG2/IRG ? In 1995 • Objective of the Original proposal: use coded ideographs and “structure symbols” to describe not yet coded ideographs. • Original proposal has 15 “Ideograph Structure Symbols” base on study on Han characters, three of them didn’t make it to ISO 10646/Unicode: • Ideograph_Proper(日): Every coded character is considered ideograph proper, thus not needed • Left_Up_Encompass: no un-encoded example • Mirror_Symmetry(非): left being mirrored to the right, but can be describe by Left_to_Right • Renames the 12 symbols as Ideograph Description Characters
Ideographic Composition Scheme • IDS describes a character using its components and indicating the relative positions of the components. • IDCs are considered operators to the components. • IDSs can be expressed by a context free grammar through the Backus Naur Form. The grammar G has four components: • Let G = {, N, P, S}, where • : the set of terminal symbols-coded radicals, coded ideographs, and the 12 IDCs. • N:the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} • S = {IDS}, which is the start symbol of the grammar • P: a set of rewrite rules
IDS::=<Binary_Symbol><IDS1><IDS1>|<Ternary_Symbol> <IDS1><IDS1><IDS1> • <IDS1> ::= <IDS> | <Ideograph_Component> • <Ideograph_Component>::= coded_ideograph | coded_radical | coded_component • <Binary-Symbol> ::= • <Ternary_Symbol> ::= • Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.
IDS allows a character to be described by different sequences • That is the composition scheme allows a character to be formed by different component characters
IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. • Not intended for rendering. • Nesting is natural in ideographs and they are reflected in in the IDS scheme
Extending the Objectives of IDCs • Using coded characters to describe not yet code ideographs both for representation and exchange • Limit standardization to only modern characters, and not some rarely used characters • Learning of character composition(education) • Revealing substructures of ideograph characters • Description of ideograph variants
The Hong Kong Glyph Specification Project • Objectives of the project: Provide for computer (font) vendors a set of glyph specification of all ISO 10646 characters and the Hong Kong Supplementary Character Set that adhere to Hong Kong’s common writing style so as to facilitate publishing in HK. • An effective H column as horizontal extension of ISO 10646(Horizontal extension is a confusing concept to many) Different styles are due to Chinese character variants
Major References Lead to this project • The Hong Kong Education Institute’s book “The Common Character Glyph Set” 《常用字字形表》, published in 1997 and revised in 2000 for elementary school education • Number of characters: 4,751 • Hand-written with some inconsistency, and variants • Hong Kong Supplementary Character Set(4,702) published in Sept.1999, some GCCS were unified with Big5, even if they are variants • Extension to HKSCS: 97 characters • 69 in BMP (including 10 in Extension A), 22 in Ext. B and perhaps 6 to Ext. C • 國家語言文字工作委員會,《信息處理用GB13000.1字符集 ﹣漢字部件規範》(GF3001-1997,一九九七年十二月) • Industrial Support Fund: Support for the Hong Kong glyph specification, HKD 3.67M
Problems and scope of work • CCS has only gives 4,751 characters, but ISO 10646 has 27,484 chars and also over 1,000 HKSCS chars in Ext. B • Avoid listing out every character in ISO 10646: using components. • The rationale is if bone should be written in certainly, any character with bone as a component should follow the same style • Characters in ISO 10646 that are out of scope • Simplified characters: follow mainland glyph • One Country/Region only characters(no unification)(Chinese GE source is not considered independent source): follow ISO 10646 provided glyph • Special working group was set up in October 2000 • www.comp.polyu.edu.hk/~glyphwg
Component Table: • Based on components defined in GF3001, 1997 with a set of decomposition rules • Total of 620 components • Some components are not coded, thus, we use our internally created codes to represent them Character Decomposition Table • Has one entry for each character and its decomposition sequence(using minimum decomposition, one level only) • characters that considered radicals or components commonly recognized, are not further decomposed • Structure symbols are maintained for facilitate both upward search and downward search
Upward search: find all characters for a given component • Downward search: find all components of a given characters
Conclusion • IDCs are introduced in Unicode 3.0 • The use is going beyond the original objective • We have already created an application using these symbols in the Hong Kong Glyph Specification which is due out this year • IDCs should also useful in ideograph variant specifications
Appendix • Components inBig5 and • HKSCS • not yet in Unicode