210 likes | 346 Views
Character Set and Language Negotiation in Z39.50 Version 3. Scope. Negotiate language of messages Negotiate character set of InternationalString Z39.50 “message” strings Optionally retrieve records in negotiated character set Character set negotiation only valid for version 3.
E N D
Scope • Negotiate language of messages • Negotiate character set of InternationalString • Z39.50 “message” strings • Optionally retrieve records in negotiated character set • Character set negotiation only valid for version 3 Stockholm, 10 August 1999
Negotiation Basics • Carried in UserInfo external object in Init • Similar to option negotiation • origin proposes list of possibilities • target selects one from list • Only a single round of negotiation takes place • Applies to complete session • Cannot change during session Stockholm, 10 August 1999
UserInfoFormat-charSetandLanguageNegotiation-2{1 840 10003 10 2} DEFINITIONS ::= BEGIN CharSetandLanguageNegotiation ::= CHOICE { proposal [1] IMPLICIT OriginProposal, response [2] IMPLICIT TargetResponse } Stockholm, 10 August 1999
Character Sets • ISO 2022 is “code page” approach to character set • ISO 10646 is ~ Unicode • Different procedures for negotiating character sets: • ISO 2022 • ISO 10646 • Can negotiate “private” character set Stockholm, 10 August 1999
OriginProposal ::= SEQUENCE { proposedCharSets [1] IMPLICIT SEQUENCE OF CHOICE{ iso2022 [1] Iso2022, iso10646 [2] IMPLICIT Iso10646, private [3] PrivateCharacterSet} OPTIONAL, -- proposedCharSets must be omitted -- if origin proposes version 2 } Stockholm, 10 August 1999
ISO 2022 • Supports 7- and 8-bit environments • “Page” is 96 graphic characters (“G set”) and 32 control characters (“C set”) • 2 G pages active at any one time (G-Right [hex 20-7F], G-Left [hex A0-FF]) • 2 C sets active (C0 [00-1F], C1 [80-9F]) • Can define 4 G pages and swap into GL, GR as needed Stockholm, 10 August 1999
ISO 2022 Escapes • Assign character sets to pages G0-G3, C0-C1 • Make G pages active in GL, GR • Character sets identified by 1 or 2 characters in the escape sequence • Character sets and the escape sequences to identify them are registered : • http://www.itscj.or.jp/ISO-IR/index.htm Stockholm, 10 August 1999
ISO 2022 negotiation • Negotiate initial assignment of G0-G3 • Negotiate initial assignment of GL, GR • Sequence of origin proposals for all of these • Target response chooses one of these proposals • In absence of negotiation must assume IRV in GL with GR undefined • no characters above hex 7F Stockholm, 10 August 1999
Iso2022 ::= CHOICE{ originProposal [1] IMPLICIT SEQUENCE{ proposedEnvironment [0] Environment OPTIONAL, proposedSets [1] IMPLICIT SEQUENCE OF INTEGER, proposedInitialSets [2] IMPLICIT SEQUENCE OF InitialSet, proposedLeftAndRight [3] IMPLICIT LeftAndRight }, } Environment ::= CHOICE{ sevenBit [1] IMPLICIT NULL, eightBit [2] IMPLICIT NULL } Stockholm, 10 August 1999
InitialSet::= SEQUENCE{ g0 [0] IMPLICIT INTEGER, g1 [1] IMPLICIT INTEGER, g2 [2] IMPLICIT INTEGER, g3 [3] IMPLICIT INTEGER, c0 [4] IMPLICIT INTEGER, c1 [5] IMPLICIT INTEGER } LeftAndRight ::= SEQUENCE{ gLeft [3] IMPLICIT INTEGER {g0 (0), g1 (1), g2 (2), g3 (3)}, gRight [4] IMPLICIT INTEGER {g1 (1), g2 (2), g3 (3)} } Stockholm, 10 August 1999
ISO 10646 • Defines a single set of 1032 possible characters (4+ billion !!!) • Divided into “planes” of 1016 characters • Only first plane currently has characters defined: “Basic Multilingual Plane” (BMP) • BMP is co-terminous with Unicode • Z39.50 negotiates ISO 10646, not Unicode per se Stockholm, 10 August 1999
Unicode Encoding Rules • UCS-4:32-bit characters • UCS-2: 16-bit character encoding with “surrogate” mechanism for characters in planes above 0 • UTF-16: like UCS-2 • UTF-8: 8-bit character encoding, with variable length multi-byte characters for all characters other than first 128 Stockholm, 10 August 1999
UTF-8 • Intended to be a “file system safe” encoding • Guarantees that every character with value below hex 80 is an ASCII character, including hex 00. • All characters with values above 7F are encoded as 2, 3 or 4 bytes • Transformation between UTF-8 and UCS-2 is simple and efficient Stockholm, 10 August 1999
Negotiating ISO 10646 • Specify the “character repertoire” (i.e. the subset of the full UCS that will be used) • Specify the encoding • Handled by object identifiers • For Unicode: • character repertoire is the full BMP • encoding can be UTF-16 or UTF-8 Stockholm, 10 August 1999
Iso10646 ::= SEQUENCE{ collections [1] IMPLICIT OBJECT IDENTIFIER, -- oid of form 1.0.10646.implementationLevel -- .repertoireSubset.arc1.arc2. .... -- [use 1.0.10646.1.2.1.3 for Unicode] encodingLevel [2] IMPLICIT OBJECT IDENTIFIER -- oid of form 1.0.10646.0.form -- where value of 'form' is 2, 4, 5, or 8 -- for ucs-2, ucs-4, utf-16, utf-8 Stockholm, 10 August 1999
Language Negotiation • Instances of InternationalString are either “message” or “name” • Language negotiation applies to “message strings” • Origin proposes one or more language codes • Codes from Z39.53 • Target may choose 1 of these proposed codes Stockholm, 10 August 1999
proposedLanguages [2] IMPLICIT SEQUENCE OF LanguageCode OPTIONAL, recordsInSelectedCharSets [3] IMPLICIT BOOLEAN OPTIONAL -- default 'false’ Stockholm, 10 August 1999
initRequest { -- SEQUENCE referenceId -- "9" --, protocolVersion 'e0'H, options 'eda2'H, preferredMessageSize 15000, exceptionalRecordSize 15000, implementationName -- "Amicus Professional Workstation" --, implementationVersion -- "3.0” --, otherInfo { -- SEQUENCE OF { -- SEQUENCE category { -- SEQUENCE categoryTypeId {1 2 840 10003 10 2}, categoryValue 0 }, information externallyDefinedInfo { -- SEQUENCE direct-reference {1 2 840 10003 10 2}, encoding single-ASN1-type proposal { -- SEQUENCE proposedCharSets { -- SEQUENCE OF iso10646 { -- SEQUENCE collections {1 0 10646 1 2 1 3}, encodingLevel {1 0 10646 1 0 8} }, Stockholm, 10 August 1999
iso2022 originProposal { -- SEQUENCE proposedEnvironment eightBit NULL, proposedSets { -- SEQUENCE OF 2, 1000, 1001, 1002, 1003, 1, 67 }, proposedInitialSets { -- SEQUENCE OF { -- SEQUENCE g0 2, g1 1001, g2 1001, g3 1001, c0 1, c1 67 } }, proposedLeftAndRight { -- SEQUENCE gLeft 0, gRight 1 } }, Stockholm, 10 August 1999
proposedlanguages { -- SEQUENCE OF -- “ENG” }, recordsInSelectedCharSets TRUE } } } } } Stockholm, 10 August 1999