280 likes | 687 Views
Multilingual Editing using RichEdit 4+. Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word. Introduction. RichEdit is a text engine with a hierarchy of presentation formats Features such as automatic choice of fonts, rich text, 2D text objects
E N D
Multilingual Editing using RichEdit 4+ Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word
Introduction • RichEdit is a text engine with a hierarchy of presentation formats • Features such as automatic choice of fonts, rich text, 2D text objects • Handling nonUnicode documents in Unicode text engines • Describe interfaces and component usage • Ways to input Unicode text using IMEs, speech • Demo
What’s RichEdit? • RichEdit 4.x is set of plain/rich-text, single/multiline Unicode/ANSI edit controls and combo/listboxes in single world-wide binary • Multilevel undo, message & com interfaces, Word compatibility, pretty rich text • Outline view, zoom, font binding, latest in IME support, and rich complex script support (BiDi, Indic, and Thai)
Clients include • Outlook mail note, post-it • Most Office dialogs • All OSes since Win98 • Wordpad, Charmap • Darwin installer • WebCalc • Project • Visual Studio, DaVinci • Publisher • Front Page • Handheld PC PocketWord • eBooks • OE (for mail header) • Borland’s Delphi • SQL server dev tools, RAID • MSN Companion chat • Via Win2k Wrapper – cc:mail, WebEditPro, Eudora, Encarta, Money(US), Sibelius, Borland TRichedit class, apps created with VB, MFC…
Some Fancier Features • Features added for ebooks: pagination, hyphenation, kerning, ClearType support, text wrap around embedded objects • Multilevel tables • Autocorrect • AutoURL detection (improved from 3.0)
2D Text Objects • RichEdit 4.5 (in development) supports WYSISYG editing of many 2D objects • Ruby, Tatenakayoko, Warichu, Kumimoji • Math: fractions, autosizing brackets, boxes, matrices, integrals • Demo will show some of these features
Backward Compatibility • Unicode text engines need to import/export text in other character sets • Given nonUnicode plain text, which codepage should one use to convert to/from Unicode? • On localized systems, system code page is a good bet • In multilingual text, you can enter text using keyboards in a variety of languages that need either Unicode or multiple code pages • For searching text, best choice seems to be to use the current keyboard code page • If text begins with a BOM, it’s Unicode • If text begins with a rich-text header, e.g., “{\rtf” or “<html>”, use appropriate conversion routine
Backward Compatibility (cont) • Need a little rich-text functionality to display Unicode plain text unambiguously in some CJK scenarios • This functionality handles font choices and language-dependent glyph variants • When a user types in text using a keyboard charset, edit engine knows charset and therefore can insert accurate Unicode text including which CJK glyph variant to use • Client gets text as pure ANSI (or Unicode) text without script clues • Would be handy to have script tags
Complex Scripts • Unicode covers many complex scripts, e.g., Arabic, Indic, Thai, ancient Korean • Complex-scripts require layout engine that translates character codes to glyph indices (often referencing ligatures) • RichEdit uses Uniscribe and the MS line-layout component for complex scripts
Font Binding • Most Unicode characters belong to scripts • Associate with each position in a document a “font bundle” • When inserting characters, assign each one to a script • For CJK, check surrounding characters for Kana and Hangul as clues to use Japanese or Korean fonts instead of Chinese • Assign scripts to neutrals and digits • Keyboard language, especially IMEs, provide strong binding clues • Format inserted characters with fonts assigned to scripts. Check current font to see if it supports required script • RichEdit 4.0 has 50 scripts for Unicode 3.1. Client can specify what default font to use for a given script.
Language Detection & Font Binding • Korean and Japanese are often easy to spot because of Hangul and Kana characters, respectively • For CJK can convert back to codepage and see if errors occur (Ken Lunde’s suggestion) • For proofing purposes, accurate language identification is needed. For font binding, script identification is usually sufficient • Typically more than one language corresponds to a script, e.g., Latin script. Essentially only one uses the Korean script • Natural language processing techniques allow good language identification if more than a few words are involved, e.g., a sentence
Font Sizing • In dialogs, 8-pt Latin characters are commonly used • 8-pt Chinese characters are hard to read, so better to use 9 points in combination with 8-pt Latin characters • Latin characters have bigger descenders than Chinese characters, since latter only need room for underline • Combining 8-pt Latin characters with 9-point Chinese characters and keeping same baseline increases line height to 9 pts plus extra height for Latin descender • Result is more like 10 points: shifts text too high in dialog box originally designed to handle one language
Unicode Surrogate Pairs • Using 2 16-bit surrogates to represent a single character complicates more than measurement and display of characters: • Arrow-key handlers and other methods that change character position must avoid ending up in between lead and trail surrogates • Input methods need to map to surrogate pair • Case changes, line-breaking rules, sorting, file formats, and backing-store manipulations in general have to recognize and deal with pairs • Surrogate code ranges make them easy to work with relative to multibyte encoding systems
Nonspacing Combining Marks • Multicode characters (surrogate pairs, CRLFs, combining-mark and variant-tag sequences) require special display/navigation handling • Render combining-mark sequences by standard systems calls and fonts that support combining marks. Better display needs layout engine that talks to OpenType • Simple caret movement across combining-mark sequences prevents stopping inside a sequence. Backspace key deletes one mark at a time • Mouse-cursor hit testing leaves selection at beginning/end of combining-mark sequence (more elegant model allows selection and editing of individual marks) • Cool thing: if you can navigate past CRLF combinations, you can modify corresponding code to handle surrogate pairs and combining-mark sequences quite easily
Interfaces • Messages and keyboard • File read/write (plain text or RTF) • TOM (Text Object Model) • ITextServices/ITextHost interfaces
RichEdit Message Interface • System messages • keyboard messages • mouse messages • clipboard messages • Edit messages – RichEdit supports all but four of the system edit messages • RichEdit messages • Character/paragraph formatting • Text input/query • Notification
File Formats • Plain text can be saved/read encoded in any codepage, including Unicode and UTF-8 • RTF is the principle rich-text format • UTF-8 RTF is used preferentially for cut/copy/paste. Can be used in stream operations • Copying text to/from Word can be a handy way to get desired formatting into a RichEdit instance • HTML is available via system converters
TOM (Text Object Model) • A set of COM dual interfaces that allow Unicode rich/plain text to be manipulated by VB, C/C++, and Java clients. • Access for spelling/grammar checkers • Accessibility • Powerful and efficient text processing primitives. Embedded scripts
TOM(cont) • ITextDocument Top-level editing object • ITextStoryRanges Enumerator for stories in document • ITextRange Primary text interface: range of text • ITextFont Character-attribute interface • ITextPara Paragraph-attribute interface • ITextTag HTML Tag interface • ITextAttributes Tag-attribute enumerator • ITextSelection Screen highlighted text range • TextRange Selection inherits all range methods
ITextServices/ITextHost Interfaces • Windowless interfaces that go beyond message interface • In-place active state – use window of the container • Fewer system resources • Faster activation and deactivation
Other Components used • Uniscribe • MS line-layout component • Windows Text Services Framework • Callbacks for access to word-break, auto correct, hyphenation, and Clear Type libraries
Input methods • Support for the latest IMEs • Speech and handwriting input (Windows Text Services Framework) • Alt-x Unicode input method • Standard hot keys
IMEs • Support Level 2 and Level 3 IMEs • Support Active Input Method Manager (AIMM) • Reconversion - user can convert final string back to composition mode, allowing easy selection of a different candidate string. • Document feed - provides IME with text for current paragraph to increase conversion accuracy during typing. • Mouse Operation - gives user better control over candidate and UI windows • Caret position - gets current caret and line info, which IME98 uses to position UI windows (e.g., candidate list).
Windows Text Services Framework • Provide support for Far East input across language Win32 platforms to aware applications. • Provide consistent UI for different input methods • speech, handwriting, IME • Coordinated input • Data persistence for dynamic text editing • Richedit supports both the native mode and Active Input Method Manager (AIMM) mode
Hex to Unicode Input Method • Type Unicode character hexadecimal code • Make corrections as need be • Type Alt+x to convert to character • Type Alt+x to convert back to hex (useful especially for “missing glyph” character) • Resolve ambiguities by selection • Input higher-plane chars using 5 or 6-digit code • MS Word 2002 standard
Unicode combobox/listbox • Emulate the system combobox and listbox • Unicode supports on all Win32 platforms • Allow mixed languages between items • Modified EM_SETTEXTEX for inserting items • Use in Office applications
Conclusions • Have described RichEdit, an engine for text display and editing with a hierarchy of presentation formats • Automatic choice of fonts for Unicode plain text including surrogate-pair characters, combining mark sequences • Handling nonUnicode documents in Unicode text engines • Described interfaces and component usage • Ways to input Unicode text using IMEs, speech • Clients include many Office and Windows apps • Able to display 2D Text Objects such as Ruby and Warichu