830 likes | 840 Views
Unicode in ALEPH. Session Outline. Key concepts Pre-UNICODE ALEPH ALEPH500.14.2 - full UNICODE version Innovations in character conversion mechanism Implementation of UNICODE - conversion, useful remarks, tips. Key Concepts. Key Concepts.
E N D
Session Outline • Key concepts • Pre-UNICODE ALEPH • ALEPH500.14.2 - full UNICODE version • Innovations in character conversion mechanism • Implementation of UNICODE - conversion, useful remarks, tips
Key Concepts • Character - the smallest component of the written text • Character set - an agreed upon set of characters • For example, • - English alphabet : 52 upper and lower case letters • - ISO 8859-5 : basic Latin + Cyrillic characters
Key Concepts • Encoding - unique assignment of characters to numerical codes For example, - ASCII : Capital letter ‘A’=65 - ISO 8859-8 : Hebrew letter ‘ ‘ = 224
Key Concepts • Encoding types: • single byte (i.e. English+another character set) : one byte = character • double byte (i.e. ANSEL, UNICODE) : 2 bytes = character • multi-byte (i.e.CJK, UTF-8) : 1,2 or 3 bytes = one character
Non-UNICODE Systems • Non-UNICODE systems: • - Based on the single byte encoding schemes • - ASCII 7-bit code space and its • 8-bit extension are limited to • 128 and 256 code positions respectively.
Non-UNICODE Systems... • Restriction of character repertoire to at most 256 characters proved to be more than rigid: • Even implementation of all European characters using Latin script needed • more than 400 characters.
Non-UNICODE Systems... • As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space.
Non-UNICODE Systems • -For example, ISO 8859 is a full series of 10 standardized multilingual single-byte coded (8-bit) character sets for writing in alphabetic languages: • -Latin1 (West European) • - Latin2 (East European) • - Latin3 (South European) • - Latin4 (North European) • - Cyrillic • - Arabic • - etc.
Non-UNICODE Systems • Results: • 1. Use of multiple inconsistent character codes because of the conflicting character sets. • For example, in Western European software environments one often finds confusion between Windows Latin 1 code page 1252 and ISO 8859-1.
Non-UNICODE Systems • 2. No easy way to input multilingual data • 3. No transparent transfer of textual data between computer systems - high risk of code page related misinterpretation
Unicode • Solution provided by the UNICODE standard: • Definition of a set of characters that encompasses most of the major languages of the world
Unicode • Based on 16-bit character codes • Any given 16-bit value always represents the same character.
Unicode • Allocation areas: • The codes are grouped in linguistic and functional categories. • The Unicode standard code space is divided into several areas, which are themselves divided into character blocks.
Unicode • Encoding schemes: • UTF-16: double byte encoding using the Unicode standard character codes • UTF-8: multi byte encoding utilizing the full 8 bits of each byte • UTF-7: multi byte encoding utilizing only 7 bits of each byte
Unicode • Mappings: • Transformation between encoding is based on an algorithm and not a table. • Readily available conversion tables from standard character sets to Unicode • Unicode can act as intermediate encoding.
Pre-Unicode ALEPH • ALEPH differentiated between 2 types of data • Bibliographic: this also includes all authorities and holding records • Administrative: patrons, items, acquisition data, serials etc..
Pre-Unicode ALEPH • Administrative data: • Inherently homogenous • Data can be stored in a single byte encoding of a given character set.
Pre-Unicode ALEPH • Bibliographic data: • In all versions of ALEPH Bibliographic information can be defined in as many languages as we want, regardless of Windows multilingual support.
Pre-Unicode ALEPH • Multiscript functionality in the non-UNICODE versions of ALEPH is possible due to the presence of ALPHA - script identifier in the field.
Pre-Unicode ALEPH • ALPHA defines input, display, and filing characteristics of the field.
Pre-Unicode ALEPH • Input: • One of the configuration files in the GUI client contains definition of the font in which you can input a certain script. • catalog.ini: • FontL=Courier New • FontH=Web Hebrew Monospace • FontA=Aleph Fixed Arabic Egypt • FontS=Courier New Cyr • FontR=Courier New Greek
Pre-Unicode ALEPH Output: A similar definition exists for the display characteristics of the bibliographic data. alephcom.ini: FontL01=11MS Sans Serif FontH01=16Web Hebrew AD FontA01=16Aleph Fixed Arabic Egypt FontS01=18Courier New Cyr FontR01=16Courier New Greek
Pre-Unicode ALEPH • Screen capture from MLT
Pre-Unicode ALEPH • Filing order is defined per script: • char_conv.A: • AL 235 000 • AH 235 235
Pre-Unicode ALEPH • Creation of indexes is ALPHA specific: • z01_rec_key \ • 03 acc_code ..............AUT • 03 alpha .................H • 03 filing_text ........…צורות חשיבה • z01_rec_key \ • 03 acc_code ..............AUT • 03 alpha .................L • 03 filing_text ...........aamodt agnar
Pre-Unicode ALEPH • Pre-UNICODE ALEPH is • ALPHA dependant
Pre-Unicode ALEPH • Restrictions: • 1. GUI input and output within a single field are limited to one code page • Input and output within a single field are still limited to 256 characters of one code page. • It is not possible to input and display Latin characters with diacritics and non-Latin characters in one field (e.g., a Russian title containing several French words).
Pre-Unicode ALEPH 2. Indexing and retrieval are script dependent. Both FIND and BROWSE are performed within the ALPHA restricted groups of index records.
Pre-Unicode ALEPH • For example, the following ‘S’ designated field : will be indexed as Cyrillic (marked as ‘S’ in the indexing tables): Browse index (z01):Words index (z97):
Pre-Unicode ALEPH ‘S’ marked headings and words can be retrieved only when the ‘S’ designated query is sent.
UNICODE ALEPH • 14.2 is the full UNICODE version
UNICODE ALEPH • Data (bibliographic + administrative) is stored in UTF-8 • GUI client is UNICODE compatible • No need in character conversion for input and display • ALPHA looses its meaning
UNICODE ALEPH - Indexing • Words: • Creation of the words index is no longer ALPHA dependent. • Index is created in UTF-8. • Indexing records increased in size to accommodate Unicode data (z97).
UNICODE ALEPH - Indexing • Browse index: • Browse index is not ALPHA specific as well • Index is created in UNICODE - 16-bit codes • Indexing records are increased in size to accommodate Unicode data (z01).
UNICODE ALEPH - GUI client • Unicode data processing
UNICODE ALEPH – GUI client • Catalog and Search clients - no limitations in input and display of UNICODE data • Administrative clients : • no limitations in display of UNICODE data in the Navigation Map, View windows, Lists • BUT • input forms use Windows controls which enable display of data corresponding to the Windows code page. Data which cannot be displayed properly appears as question marks. The fields are locked for editing.
UNICODE ALEPH - WEB OPAC • WEB OPAC - UFT-8 input and display
UNICODE ALEPH - WEB OPAC • ALEPH is sensitive to browser types. • If browser is less than NetScape 6 or Internet Explorer 5, we assume that it does not support UTF-8. • www_server_defaults defines the default character set for the non-utf compatible browsers. • Example: • setenv server_default_charset "iso-8859-1"
UNICODE ALEPH - tables and html pages • Tables and html pages are written in ISO and on-load are converted to utf-8. • The utf-8 variants of the WEB pages and tables are stored under ./alephe/utf_files.
UNICODE ALEPH - tables and html pages • The system converts tables and html pages in accordance with the default character conversion definition in $alephe_root/aleph_start_505: • setenv default_character_conversion 8859_1_TO_UTF
UNICODE ALEPH - Printing • Printouts produced prom the GUI client: • - It UNICODE data processing does not succeed, the data is converted to the Windows codepage. Unrecognized characters are displayed as question marks.
UNICODE ALEPH - Printing • Printouts produced from the WEB OPAC are converted to single byte codepage. Transliteration of unrecognized characters is possible.
UNICODE ALEPH - Services Processing of UTF data is enabled in the batch services.