180 likes | 302 Views
Dynamic Glyph Generation. Based on variable length encoding schema. Yap Cheah Shen eForth Technology. Glyph & Typesetting Workshop Kyoto, 29Nov2003. Outline of Presentation. Morpheme: Latin vs. Han Latin text encoding Missing character in Chinese text Solution Implementation details
E N D
Dynamic Glyph Generation Based on variable length encoding schema Yap Cheah Shen eForth Technology. Glyph & Typesetting Workshop Kyoto, 29Nov2003
Outline of Presentation • Morpheme: Latin vs. Han • Latin text encoding • Missing character in Chinese text • Solution • Implementation details • Glyph decomposition database • Topological conversion of strokes • Automatic frame calculation • Integrating to existing OS • Other issue
Morpheme: Latin vs. Han • Morpheme is the smallest meaningful unit in a language. • For Latin text, it is “word”. • For Chinese text, it is Hanzi or Kanji. • Representing a real-world idea, morpheme keeps changing from time to time • Morphemes form an open-set.
Latin Text Encoding • Alphabets form a fix set of symbols. • All words can be represented as sequences of alphabets. • They are the ideal encoding units for Latin text; e.g., ASCII. • No “missing word” encoding problem.
Missing Characters in Chinese Text • Not all existing Hanzi are encoded. • Hanzi are in an open-set , theoretically, historically and practically. • Wrong assumptions and designs of existing encoding schema. • Unending loop of assigning code point, OS update, new font, new input method table Industries are happy. (users suffer)
Solution-1 • Parts or components as encoding unit. 日 月 金 木 水 火 土 人 心 手 口 女 艹 疒 犭 • Most characters can be represented by a finite set of basic parts. • Strokes are used to construct rarely used parts.( thousand of parts appear only once or twice)
Solution -2 • A close-set of basic parts and strokes as encoding unit. • 3 Joining operator : horizontal , vertical, and enclosing. • 1 Shielding operator : for hiding stroke • Prefix notation : allowing recursive composition.
Solution-3 • Ordinary CJK fix-length encoding schema, numeric value as character code. • Input method table • Convert input keystroke to character code. • Static Font file • Glyph data is pre-designed • Access glyph data by character code. • Text file • Sequence of character code.
Solution-4 • Additional feature of variable length encoding CJK environment. • Input • Character can be sorted, filtered by parts. • Compatible with any existing input method. • Display • Font file stores commonly used characters and parts. • Generate glyph on the fly by glyph descriptive sequence. • Storage and data-exchange • Compatible with Unicode. • Ideographic description sequence.
Dynamic Glyph Generator • Input: • Various type of Variable length descriptive character code sequence. • 構字式 of Academia Sinica • 組字式 of CBETA • Unicode ideographic descriptive characters • Output: display & print • True-type compatible outline • Rasterized bitmap. • Macromedia Flash, SVG • The Task: a layout problem, fitting a 1 dimensional sequence into a 2 dimensional square.
Implementation -1 The system consists of 3 major parts • Glyph decomposition database • Courtesy of Prof. Hsieh from Academia Sinica, Taiwan http://www.sinica.edu.tw/~cdp/ • Outline of strokes and components • Beijing ZhongYi Co. professional outline font vendor. http://www.zhongyicts.com.cn/ • The eForth system: putting everything together, hardware-software co-engineering.
Implementation-2 • Glyph decomposition database • All CJK glyph defined by Unicode 4.0 , 71000+ in total. • 549 basic parts, stroke sequence are preserved • 3996 total parts • Total parts frequency :165122 • Accumulated frequency: • Top 50 : 51389 = 31% • Top 200 : 87381 = 53% • Top 1000: 129393 = 78%
Implementation-3 • Stroke are describe as a outline with skeletal line. • Both outline and skeletal line are Quadric Bezier curves. • Outline points are recalculated according to scaled- skeletal line. • Result: • Stroke data is highly reusable • Stroke weights are adjustable
Implementation-4 • Automatic frame calculation • Algorithm of estimating the complexity of each parts, to decide the proportion of the part in result glyph. • 漁: 氵25%, 魚 70% , roughly. • 觀 : 雚 55%, 見 40%, roughly. • Result: • Clear glyph descriptive expressions • Search engine friendly • Human readable
Integrating into existing OS/GUI • String manipulation library • Number of characters • -1 for operators, +1 for characters • Characters width • Graphic sub-system • drawing a text line (e.g. ExtTextOut) • Text handling widgets • Awareness of glyphs expression for caret, selection and delete/backspace.
Other Issues • Quality of the glyph • Trade-off with space: More part outlines, better quality. • Speed of generation • No problem for IBM PC, glyph generation is rare. • For handheld device, Hardware acceleration is recommended.
Examples ⿱ Vertical combination ⿰ Horizontal combination ⿴ enclosing – hide • 盟 = ⿰明皿 or ⿰⿱日月皿 • 李世民 = 民-5 hide 5th stroke • 玄燁 = 玄-5 • 丘-4 = U+20009