1 / 83

Introduction to UNICODE in ALEPH: Key Concepts and Implementation Tips

Explore the key concepts of Unicode, character sets, encoding types, and the transition from non-Unicode to Unicode systems in ALEPH. Learn about the innovations in character conversion mechanisms and practical implementation tips for utilizing UNICODE effectively. Discover the importance of Unicode in enabling multilingual data input, transparent data transfer between systems, and eliminating the limitations of non-Unicode systems. Gain insights into the Unicode standard, characters allocation areas, encoding schemes (UTF-16, UTF-8, UTF-7), mappings, and ALEPH's pre-Unicode era distinguishing bibliographic and administrative data handling. Uncover the multiscript functionality in legacy ALEPH versions with ALPHA script identifiers and define input, display, and filing characteristics for different scripts. Enhance your understanding of Unicode integration and ensure efficient data processing in your ALEPH workflows.

joela
Download Presentation

Introduction to UNICODE in ALEPH: Key Concepts and Implementation Tips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode in ALEPH

  2. Session Outline • Key concepts • Pre-UNICODE ALEPH • ALEPH500.14.2 - full UNICODE version • Innovations in character conversion mechanism • Implementation of UNICODE - conversion, useful remarks, tips

  3. Key Concepts

  4. Key Concepts • Character - the smallest component of the written text • Character set - an agreed upon set of characters • For example, • - English alphabet : 52 upper and lower case letters • - ISO 8859-5 : basic Latin + Cyrillic characters

  5. Key Concepts • Encoding - unique assignment of characters to numerical codes For example, - ASCII : Capital letter ‘A’=65 - ISO 8859-8 : Hebrew letter ‘ ‘ = 224

  6. Key Concepts • Encoding types: • single byte (i.e. English+another character set) : one byte = character • double byte (i.e. ANSEL, UNICODE) : 2 bytes = character • multi-byte (i.e.CJK, UTF-8) : 1,2 or 3 bytes = one character

  7. Non-UNICODE Systems • Non-UNICODE systems: • - Based on the single byte encoding schemes • - ASCII 7-bit code space and its • 8-bit extension are limited to • 128 and 256 code positions respectively.

  8. Non-UNICODE Systems... • Restriction of character repertoire to at most 256 characters proved to be more than rigid: • Even implementation of all European characters using Latin script needed • more than 400 characters.

  9. Non-UNICODE Systems... • As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space.

  10. Non-UNICODE Systems • -For example, ISO 8859 is a full series of 10 standardized multilingual single-byte coded (8-bit) character sets for writing in alphabetic languages: • -Latin1 (West European) • - Latin2 (East European) • - Latin3 (South European) • - Latin4 (North European) • - Cyrillic • - Arabic • - etc.

  11. Non-UNICODE Systems • Results: • 1. Use of multiple inconsistent character codes because of the conflicting character sets. • For example, in Western European software environments one often finds confusion between Windows Latin 1 code page 1252 and ISO 8859-1.

  12. Non-UNICODE Systems • 2. No easy way to input multilingual data • 3. No transparent transfer of textual data between computer systems - high risk of code page related misinterpretation

  13. UNICODE

  14. Unicode • Solution provided by the UNICODE standard: • Definition of a set of characters that encompasses most of the major languages of the world

  15. Unicode • Based on 16-bit character codes • Any given 16-bit value always represents the same character.

  16. Unicode • Allocation areas: • The codes are grouped in linguistic and functional categories. • The Unicode standard code space is divided into several areas, which are themselves divided into character blocks.

  17. Unicode

  18. Unicode • Encoding schemes: • UTF-16: double byte encoding using the Unicode standard character codes • UTF-8: multi byte encoding utilizing the full 8 bits of each byte • UTF-7: multi byte encoding utilizing only 7 bits of each byte

  19. Unicode • Mappings: • Transformation between encoding is based on an algorithm and not a table. • Readily available conversion tables from standard character sets to Unicode • Unicode can act as intermediate encoding.

  20. Pre-UNICODE ALEPH

  21. Pre-Unicode ALEPH • ALEPH differentiated between 2 types of data • Bibliographic: this also includes all authorities and holding records • Administrative: patrons, items, acquisition data, serials etc..

  22. Pre-Unicode ALEPH • Administrative data: • Inherently homogenous • Data can be stored in a single byte encoding of a given character set.

  23. Pre-Unicode ALEPH • Bibliographic data: • In all versions of ALEPH Bibliographic information can be defined in as many languages as we want, regardless of Windows multilingual support.

  24. Pre-Unicode ALEPH • Multiscript functionality in the non-UNICODE versions of ALEPH is possible due to the presence of ALPHA - script identifier in the field.

  25. Pre-Unicode ALEPH

  26. Pre-Unicode ALEPH • ALPHA defines input, display, and filing characteristics of the field.

  27. Pre-Unicode ALEPH • Input: • One of the configuration files in the GUI client contains definition of the font in which you can input a certain script. • catalog.ini: • FontL=Courier New • FontH=Web Hebrew Monospace • FontA=Aleph Fixed Arabic Egypt • FontS=Courier New Cyr • FontR=Courier New Greek

  28. Pre-Unicode ALEPH Output: A similar definition exists for the display characteristics of the bibliographic data. alephcom.ini: FontL01=11MS Sans Serif FontH01=16Web Hebrew AD FontA01=16Aleph Fixed Arabic Egypt FontS01=18Courier New Cyr FontR01=16Courier New Greek

  29. Pre-Unicode ALEPH • Screen capture from MLT

  30. Pre-Unicode ALEPH • Filing order is defined per script: • char_conv.A: • AL 235 000 • AH 235 235

  31. Pre-Unicode ALEPH • Creation of indexes is ALPHA specific: • z01_rec_key \ • 03 acc_code ..............AUT • 03 alpha .................H • 03 filing_text ........…צורות חשיבה • z01_rec_key \ • 03 acc_code ..............AUT • 03 alpha .................L • 03 filing_text ...........aamodt agnar

  32. Pre-Unicode ALEPH • Pre-UNICODE ALEPH is • ALPHA dependant

  33. Pre-Unicode ALEPH • Restrictions: • 1. GUI input and output within a single field are limited to one code page • Input and output within a single field are still limited to 256 characters of one code page. • It is not possible to input and display Latin characters with diacritics and non-Latin characters in one field (e.g., a Russian title containing several French words).

  34. Pre-Unicode ALEPH 2. Indexing and retrieval are script dependent. Both FIND and BROWSE are performed within the ALPHA restricted groups of index records.

  35. Pre-Unicode ALEPH • For example, the following ‘S’ designated field : will be indexed as Cyrillic (marked as ‘S’ in the indexing tables): Browse index (z01):Words index (z97):

  36. Pre-Unicode ALEPH ‘S’ marked headings and words can be retrieved only when the ‘S’ designated query is sent.

  37. UNICODE ALEPH

  38. UNICODE ALEPH • 14.2 is the full UNICODE version

  39. UNICODE ALEPH • Data (bibliographic + administrative) is stored in UTF-8 • GUI client is UNICODE compatible • No need in character conversion for input and display • ALPHA looses its meaning

  40. UNICODE ALEPH - Indexing • Words: • Creation of the words index is no longer ALPHA dependent. • Index is created in UTF-8. • Indexing records increased in size to accommodate Unicode data (z97).

  41. UNICODE ALEPH - Indexing • Browse index: • Browse index is not ALPHA specific as well • Index is created in UNICODE - 16-bit codes • Indexing records are increased in size to accommodate Unicode data (z01).

  42. UNICODE ALEPH - GUI client • Unicode data processing

  43. UNICODE ALEPH – GUI client • Catalog and Search clients - no limitations in input and display of UNICODE data • Administrative clients : • no limitations in display of UNICODE data in the Navigation Map, View windows, Lists • BUT • input forms use Windows controls which enable display of data corresponding to the Windows code page. Data which cannot be displayed properly appears as question marks. The fields are locked for editing.

  44. UNICODE ALEPH - WEB OPAC • WEB OPAC - UFT-8 input and display

  45. UNICODE ALEPH - WEB OPAC • ALEPH is sensitive to browser types. • If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. • www_server_defaults defines the default character set for the non-utf compatible browsers. • Example: • setenv server_default_charset "iso-8859-1"

  46. UNICODE ALEPH - tables and html pages • Tables and html pages are written in ISO and on-load are converted to utf-8. • The utf-8 variants of the WEB pages and tables are stored under ./alephe/utf_files.

  47. UNICODE ALEPH - tables and html pages • The system converts tables and html pages in accordance with the default character conversion definition in $alephe_root/aleph_start_505: • setenv default_character_conversion 8859_1_TO_UTF

  48. UNICODE ALEPH - Printing • Printouts produced prom the GUI client: • - It UNICODE data processing does not succeed, the data is converted to the Windows codepage. Unrecognized characters are displayed as question marks.

  49. UNICODE ALEPH - Printing • Printouts produced from the WEB OPAC are converted to single byte codepage. Transliteration of unrecognized characters is possible.

  50. UNICODE ALEPH - Services Processing of UTF data is enabled in the batch services.

More Related