1 / 18

Migration of a 4GL and Relational Database to Unicode

Migration of a 4GL and Relational Database to Unicode. Tex Texin. Director, International Products. Presentation Goals. Outline Migration Steps Describe Design Considerations Leverage Existing Double-byte Implementation Describe Impact on 4GL and Report Formats.

tmisty
Download Presentation

Migration of a 4GL and Relational Database to Unicode

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Migration of a 4GL and Relational Database to Unicode Tex Texin Director, International Products

  2. Presentation Goals • Outline Migration Steps • Describe Design Considerations • Leverage Existing Double-byte Implementation • Describe Impact on 4GL and Report Formats

  3. PROGRESS Application Development Suite • Powerful tools for the rapid creation of distributed business applications • Creates character, GUI, or web-based clients with common source • Host-based, client-server, or n-tier distribution on variety of platforms • Scalable, open, and robust RDBMS • International, double-byte enabled

  4. Possible Configuration Options GUI Client Client-Server Web-based Client Database Server Progress Database Host-based Character Client Optional n-tier Application Server Other Database

  5. Why do our customers need Unicode? • Many do not... However, • Multinationals deploy across regions with incompatible character sets, yet they must share data between them. • Programs are distributed worldwide with one container of text in many languages. • Certain applications require multilingual databases. E.g. Translation systems and web-based applications.

  6. The Existing Architecture • 1.5M lines of C code • 0.3M lines of 4GL code • Double-byte enabled • CJK, 9 double-byte charsets supported • 2-byte only, no 3 or 4-byte • No shift-sequenced charsets • DBE changes earmarked, easy to find • 4 years, 3 developers, 2 QA

  7. Estimated cost of implementing UCS-2, was very big! • Changing to 16-bit text units affects almost every source module • Largest cost is separating char variables based on usage for text or binary data. • Use 16-bit null terminators, ignore 8-bit “A” Þ 0041, 0000 “Ô Þ 0100, 0000 • Pointer arithmetic (advance 2 bytes) • Sizing (bytes or characters) • New API to use new WIDE TEXT datatype

  8. Product requirements for a multilingual version Minimize cost for application migration Minimize cost for application upgrade Minimize support cost One executable! Maintain user-definable character sets Add UTF-8 as just another character set UTF-8 algorithms are compatible with other charsets

  9. Scaled down multilingual proposal: UTF-8 implementation Implement UTF-8 as 3-byte character set Leverage & extend double-byte enabling Places to change are already earmarked Restrict to composed characters for now Restrict to no surrogates Supports all the markets we are in UTF-8-enable 4GL and RDBMS first Provides multilingual logic and storage Java+other client technologies coming

  10. Architecture changesUTF-8-enabling the string library N-byte enable character+string functions GetNextChar, GetPreviousChar GetCharacterSize (table-based) Modified IsFirstByte New GetColumnLength New datatype normalized “BIG” char Minor algorithm changes for efficiency Find Character

  11. Architecture changesUTF-8-enabling character tables String libraries use character tables Alphanumeric, Lead-byte, Tail-byte Upper, lower case (700+ characters) New property ColumnCount New table formats Old architecture presumed 256 byte table Now organized by range lists and trie Update table compiler & allow hex entry

  12. Architecture changesUTF-8-enabling sorting How to sort multilingual data? Binary sort used for double-byte data With UTF-8, Europe is 2-byte, CJK 3-byte Solution Binary sort on server Client uses native sort Bump key length limit for UTF-8 Next phase will be enhanced sort

  13. Architecture changesCharacter conversion algorithms Existing, user-definable, conversions Single-byte character set table maps Double-byte Shift-JIS - EUCJIS algorithm New table-driven automated conversions Single-byte to UTF-8, and back Double-byte to UCS-2 and back UTF-8 - UCS-2 Trie for speed and memory optimization Requires significant QA for data integrity

  14. Architecture changesImpact on the 4GL user 4GL is character set independent Almost all functions are character-based 3 functions require optional byte-basing Length, Substring, Overlay Options: Byte, Character Add new option: Column Format (Picture) Phrase “XXXX” has different meaning for UTF-8

  15. Status • Functioning Well • FCS Dec. 11, 1998 • Implemented with very low cost • Performance is good • Testing is most significant cost • Reviewing all character set properties • Evaluating all conversions

  16. Futures For the Progress International Team Multilingual Clients Enhanced Character Folding Enhanced Sorting For Progress Customers Deployment of multilingual databases Worldwide access to these databases Worldwide deployment of multi-language applications

  17. Conclusions • Migration can be achieved in phases • Migration thru UTF-8 can be low cost • Double-byte applications can migrate easily to UTF-8 • Markets requiring segregated databases can now share one common database and provide world-wide access

  18. Any questions?

More Related