580 likes | 1.17k Views
Adabas Implementation. Wide Character Definition. Field Type W implementedThis field type is not necessarily a Unicode field type, but can be used for any character type which requires more then 1 byte for a character representationField types <U> in Natural require the field type <W> in Adabas. D
E N D
2. Adabas Implementation
3. Wide Character Definition Field Type W implemented
This field type is not necessarily a Unicode field type, but can be used for any character type which requires more then 1 byte for a character representation
Field types <U> in Natural require the field type <W> in Adabas
4. Data Conversion Default character types in <WE> type fields are UTF 16 (See Unicode definition)
Fields with format <A> being stored into <W> type fields will be converted according to the rules of UTF 16
Fields read into <A> type fields from <W> type fields will be converted into the current known code page
Characters which cannot be translated will be ignored
5. Parameter Definition Parameter settings
Open Record Buffer (OPRB=(dbid=xx,wcharset=‘utf16-le’,upd=1-13)
Source Format (SUTF8=on)
7. Natural and Unicode In order to be able to handle Unicode, Code Page support had to be implemented in Natural
Code Pages are a definition, which internal value corresponds to which visible character
Example: C116 corresponds to the character A on a mainframe.
All Natural Objects retain the code page identification which was used at development time
Natural knows at startup time, which code page the current end user is using
All data will be translated into his code page at execution time
Only when CPAGE=ON was used at compile time
8. Natural and Unicode The new data type <U>
All fields defined as <U> will be considered to be Unicode fields
The new constant type <U>
Allows the definition of constants in the Unicode format.
Internal Data format is UTF16 and NFC
UTF16 is a 2-byte encoding form for Unicode data
NFC is the normalization of Unicode data
Example: â can be expressed as a followed by ^, or as a single character. The first case means that we will have a 4-byte representation of the same character, whereas the second case only requires 2 bytes.
Via NFC all such expanded representations will be combined into a single character (if a pre-composed character exists)
9. Natural and Unicode Data Conversion and Comparison
Done via ICU, the Open Source library to handle Unicode functionality.
ICU is a separate module, will only be activated when Unicode handling us required (MF only, in Open System ICU is always present)
Two versions available:
Small, ICU Conversion only. 2.8MB
Large, ICU Conversion plus Collation Functionality. 4.5MB
Can be controlled by the ICU Parameter
ICU will be used for the code page support as well (See Natural 4.2 Functionality)
Optimization: The Unicode conversion table for the current active code page is generated into the Thread (MF only)
10. Natural and Unicode CPDATA (MF only)
(CMSYNIN,codepage),(CMOBJIN,codepage),(CMPRINT,codepage)
Specifies in what codepage data have to be expected or be written.
All data will be translated (using ICU) form the internal to the external code page and vice versa.
The parameter CMSYNIN is available as a separate parameter in Open System as well
11. Natural and Unicode New Components in Natural
NATICU
NATICUCV International Components for Unicode
NATICUXL
NATCPTAB Code Page Character Translation Tables
NATSCTU Scanner Character Table for Unicode
NATWEB Natural Terminal Driver for Web I/O
12. Natural and Unicode New Macros
NTCFICU Unicode Support
NTXML Support of REQUEST DOCUMENT and PARSE statement
New Parameters of Macro NTPRM
CPCVERR Code Page Conversion Error
CPOBJIN Code Page of Batch Input File
CPPRINT Code Page of Batch Output File
CPSYNIN Code Page of Batch Input File for Commands
DBGERR Automatic Start of Debugger at Runtime Error
SLOCK Source Locking
SRETAIN Retain Source Format
THSEP Dynamic Thousands Separator
THSEPCH Thousands Separator Character
13. Natural and Unicode
14. Natural and Unicode
15. Natural and Unicode
16. Natural and Unicode
17. Natural and Unicode
18. Natural and Unicode CPOPT=ON ? Use translation tables instead of functions, if possible.
CPOPT=OFF ? Use ICU functions in any case.
19. Natural and Unicode *CODEPAGE
(A64), contains the name of the current active code page
These Names correspond to the IANA Standard (Internet Assigned Numbers Authority) (http://www.iana.org/)
*LOCALE
(A8), contains the region and language information
Based on ISO 3166 and ISO 639
20. Natural and Unicode LOCALE = en_US English language in the USA.
LOCALE = en_UK English language in the UK.
LOCALE = de_DE German language in Germany.
LOCALE = de_AT German language in Austria.
21. Natural and Unicode ISO 3166 Extract (Country Names and Identifiers)
DE DEU 276 Germany
GH GHA 288 Ghana, Republic of
GI GIB 292 Gibraltar
GR GRC 300 Greece, Hellenic Republic
GL GRL 304 Greenland
GD GRD 308 Grenada
GP GLP 312 Guadaloupe
GU GUM 316 Guam
GT GTM 320 Guatemala, Republic of
GN GIN 324 Guinea, Revolutionary People's Rep'c of
22. Natural and Unicode
23. Natural and Unicode
24. Natural and Unicode
25. Natural and Unicode The size of the modules
NATICU ca. 5 MB
NATICUCV ca. 3,5 MB
NATICUXL ca. 12.5 MB
NATXML ca. 1 MB
Natural Buffers (Threadsize)
NATICU ca. 150 KB
NATXML ca. 150 KB
26. Natural and Unicode MOVE NORMALIZED op1 TO op2
Converts Unicode data or op1 (type U) into the normalized form into the target variable op2 (U or A)
MOVE ENCODED operand1 [ENCODED [IN] [CODEPAGE] operand2 ] TO operand3 [ENCODED [IN] [CODEPAGE] operand4 ]
Converts data from one code page into another
DEFINE PRINTER (n) CODEPAGE code-page-name (MF only)
Specifies the code page to be used for printing data
EXAMINE
Contains specific features to support Surrogates
Based on Characters, not on bytes
27. Natural and Unicode There is no direct terminal I/O available on Mainframe (The green terminals don’t support Unicode).
All U-Type fields will be converted internally into the current active code page and displayed as A-Type fields
Natural Web I/O
28. Natural and Unicode All modern Windows operating systems are based on Unicode internally. This functionality is used by Natural, so input of Unicode data and the display of such data is no problem at all.
How to enter Unicode data?
Microsoft offers so called IME (Input Methods) for various languages. These can be downloaded and installed on all newer Office Packages.
http://office.microsoft.com/en-us/assistance/HA010347361033.aspx
29. Natural and Unicode The following Statements support Unicode data
ASSIGN
MOVE
:=
CALL / CALL FILE / CALL LOOP / CALLNAT
COMPRESS
CREATE OBJECT /
DECIDE / IF
DEFINE DATA / CLASS / FUNCTION / PROTOTYPE / WINDOW / WORKFILE
DISPLAY / WRITE / PRINT / INPUT / REINPUT (See MF I/O Discussion)
EXAMINE
EXPAND / RESIZE / REDUCE
FETCH / RUN
30. Natural and Unicode The following Statements support Unicode data (Continued)
FIND / READ / STORE / GET / HISTOGRAM / STORE
INSERT (SQL)
SUBSTRING
PARSE
PERFORM
PASSW
PROCESS
READ WORK
REDEFINE
REQUEST DOCUMENT
RESET
SEPARATE
SET CONTROL / KEY
SORT (Not first release)
31. Natural and Unicode The following Statements support Unicode data (Continued)
COMPUTATION (ADD / SUBTRACT / MULTIPLY / DIVIDE)
STACK
TERMINATE
32. Natural and Unicode Data Transfer is possible between various formats
Source Target Rule
U U none
U B Binary Move, Value might be truncated or padded with blanks
U A Converted into current code page
A U converted from current code page
B1-B4 U B1.B4 => A => U
Bn, n>4 U Binary Move, Value might be truncated or padded with blanks
D U D => A => U
F U F => A => U
I U I => A => U
L U L => A => U
N U N => A => U
P U P => A => U
T U T => A => U
33. Natural and Unicode Substring functionality is the same as for <A> type fields, only the unit of work is 2 bytes.
This means it will not be possible to destroy a Unicode data string by cutting Unicode data in the middle of the two bytes used for one character
This will be true only when no surrogates are used. Surrogates may be cut.
Use EXAMINE to Identify correct character position
Or use <MOVE NORMALIZED> to convert combining characters into basic characters.
34. Natural and Unicode U can be compared to A B and U
IF op1 = op2 (Formats A and U)
Converts op2 to the format of op1 before the operation is done
ICU is used for the comparison
Format B, binary compare is done, no conversion
35. Natural and Unicode REDEFINE can be used for U-Type fields (Not for Dynamic Variables)
This means that Unicode data can be processed as any other format types
Danger, this can destroy the data
But allows to see how the data look like internally
Can improve performance, especially when using in IF clauses
36. Natural and Unicode The MASK function can be used for Unicode fields as well
Handling is done via ICU
Example: 0030 (Unicode value of the digit <0>) will return true with the format <N> (Numeric)
37. Natural and Unicode The following system Functions support Unicode data
COUNT
MAX
MIN
MAXVAL
MINVAL
NCOUNT
OLD
POS
RET
TRIM (Open Systems only)
VAL
38. Natural and Unicode The development Environment SPod is the only environment where the developer can make the full use of all Unicode functionality. (U-Constants)
The complete Source of the Natural Object will be converted into Unicode.
39. Natural and Unicode The library SYSEXV contains example programs for all new functions and features in new versions of Natural. This includes some examples as well which demonstrate the usage of the new Unicode support functionality.
* Program ...... V62UCONS
* Program ...... V62UDEF
* Program ...... V62UENC
* Program ...... V62UEXA
* Program ...... V62UINP
* Program ...... V62UINTR
* Program ...... V62UMOVE
* Program ...... V62UNORM
* Program ...... V62UPARM
* Program ...... V62UVAR
40. Natural Add. Products and Unicode SYSSEC, to be evaluated
SYSMAIN, to be evaluated
SYSDIC / Predict, to be evaluated
NAF, to be evaluated
41. Natural and Unicode http://www.unicode.org/
http://oss.software.ibm.com/icu/
Helpful web pages
http://i.tom.com/ (It’s all Japanese)
http://www.greeknewsonline.com/ (It’s all Greek to me)
http://www.russianpassportservice.com/ (Need some Russian?)
http://www.aljazeera.net/ (Want Arabic data?)
http://www.a7.org/ (Or better Hebrew?)
http://www.omniglot.com/writing/alphabets.htm (Or some other alphabets?)
http://www.decodeunicode.org/ (Unicode decoded)
43. Contents Getting external Unicode data into the application
Manipulating Unicode data
Storing/saving Unicode data
Retrieving Unicode data from the database
Output of Unicode data
44. Getting External Unicode Data into the Application There are various sources which can be used to enter Unicode data into a Natural application. Of course there is the database, but this will not be considered in this section. The other methods are:
Terminal Input
Read Work File
Request Document
CALL External Programs
Conversion within the Application
45. Terminal Input The first is native Unicode data entered into a corresponding Unicode field. This is the easy and straight forward method. This capability however is only available on Windows platforms. The new Windows versions (XP/2000/NT) all have direct Unicode support implemented and Natural makes full use of this functionality. All other platforms (UNIX, Mainframe) do not have the terminal feature available which can support Unicode data input.
This is handled by the second approach – entering data in the local code page, and converting the data internally into Unicode. One of the new features of Natural is it provides the information for which code page currently is active for data input, and it is quite easy to convert the existing data into Unicode. This will be done automatically by Natural when the field format used in the corresponding INPUT statement is U (for Unicode).
The third possibility is the use of the new feature in Natural 4.2 called: “Web IO”. Basically, this is a new terminal type defined for Natural which transfers the output and input data to a standard Web browser. All standard Web browsers support Unicode, and by the use of this approach Natural can do terminal IO with Unicode data even from the mainframe.
46. Read Workfile The READ WORK FILE feature of Natural can be used to enter any kind of data into the application. The challenge here is that work files do not have any description of the data which is available in the respective work file. It is the responsibility of the designer to know the format of the work file. This information can then be used to read Unicode data directly into Unicode fields or to read data represented in a predefined code page and convert this data into Unicode.
47. Request Document The REQUEST DOCUMENT statement can be used to read XML documents directly into the Natural application. XML data are always represented in Unicode, so all data in the document has to be handled as Unicode data.
48. Call external Program Data can be retrieved from calls to COBOL or other 3rd GL programs as well. Again, it is the responsibility of the designer to know about the interface of such programs and to know in what code page the data are returned. Once the information is known, it is no longer a problem to either convert the data or read the data directly into Unicode fields.
49. Conversion within the application It will always be possible to convert existing alphabetical data into Unicode within the application. This technology has to be used when the format (Code page) of the existing data is known.
50. Manipulating Unicode Data The processing of Unicode data is different from the processing of data available in the standard code page. Let’s define what processing actually is:
Transfer of Unicode data to other fields, even non-Unicode fields.
Comparing Unicode data with other fields, even non-Unicode fields.
Extracting data from Unicode fields based on some internal structure.
Unicode and the REDEFINE Statement.
51. Transfer of Unicode Data to Other Fields, Even Non-Unicode Fields Unicode data can easily be transferred into non-Unicode (i.e. A-type) fields. Natural will automatically convert the Unicode data into the current active code page. There are 2 possible outcomes with this depending on the configuration of Natural. First, Natural will issue an error message if the Unicode data cannot be converted completely into the code page of the target field (the ON ERROR logic handles this situation).
Second, all characters which are not available in the current code page will be replaced by a special character, indicating that there has been a character, but there is no equivalent character available in the current code page. The result cannot be reversed, so a transfer back to a Unicode field will not be able to restore the missing characters.
52. Comparing Unicode Data With Other Fields, Even Non-Unicode Fields A challenge with Unicode is that many characters can be represented internally in different forms. So the character <ä> can be represented as a single character, but as a sequence of the base character <a> followed by the two dots above. Externally the data is identical, but the internal representation is different. An IF statement check will return the wrong information if two fields contain the same information. Natural offers the facility to normalize Unicode data. Normalizing Unicode data will convert all combined characters into the single base character, when available. Now a comparison will always return the correct answer.
53. Extracting Data from Unicode Fields Based on Some Internal Structure Many input fields have a predefined structure. A typical example is the combination of <Last-name, First-name>. It is very convenient to type names in this form (Or even better, allow both possibilities: <First-name “Space” Last-name). The application now has the task to separate the two units, recognizing if the format 1 or 2 has been used, and saving the data in corresponding fields. This has been programmed for single byte languages many times already, and will be required for <Unicode fields as well.
The statements SCAN and EXAMINE have been adapted to work on two byte bases for Unicode fields. So an EXAMINE field for U’,’ will analyze the Unicode field for a Unicode character “,”. The resulting position information is based on two byte characters as well. A subsequent MOVE SUBSTRING instruction can use this position and the length of the Unicode characters directly to transfer the data correctly into the target Unicode field.
54. Unicode and the REDEFINE Statement There has been great discussion amongst the Natural development team if the REDEFINE statement should be allowed for Unicode fields. A developer can make a big mess by redefining Unicode data incorrectly, thus destroying the Unicode characters.
The result of the Natural development team’s discussion is the conclusion that all the developers programming in Natural are highly intelligent and responsible persons and they will know exactly what they are doing with their data. So the REDEFINE statement is available for Unicode fields. Any Unicode field can be redefined as a hexadecimal field or as a single byte a-type field. This allows the developer to process the Unicode data in all possible methods they can think of. The first approach simply would be that the developer can see how his/her data is encoded internally.
55. Storing/Saving Unicode Data Unicode data are stored either in external work files or in databases. The work file routines accept all Unicode data without any restriction, so this is a straight forward approach to saving Unicode data for further processing either by Natural or by any other application.
To store Unicode in a database, the data definition module has to be defined with “W” type fields in Adabas, or as Unicode field in DB2. No further preparation needs to be done. Natural automatically transfers the Unicode data correctly to the database in UTF16 format. In Adabas the data will be converted to the UTF8 format for compression reasons. So in a typical Western Latin-oriented application the amount of data is only slightly larger than the original data stored in a local code page.
56. Retrieving Unicode Data from the Database All data read into Unicode fields will be converted automatically into the UTF16 format. No further processing is done; in particular, no normalization is done. Natural assumes that the data have been normalized before the data was stored in the database.
It is possible to use Unicode data in search criteria, i.e. Unicode fields can be descriptor fields. This allows the storage and retrieval of name values written in any kind of character set, including Chinese, Russian, Thai and all other languages.
One of the features of ADABAS is the ability to convert internal data into external representation controlled by the specification in the format buffer. This feature makes it possible to access Unicode data in Adabas from non-Unicode applications as well. These applications will get the data automatically converted into the code page of the active application. This allows coexistence of legacy applications with new Unicode enabled applications.
57. Output of Unicode Data Once the Unicode data has been processed the end user typically wants to see the data either on their screen or on a piece of paper. A prerequisite for this functionality is the ability of the output device to handle Unicode data. When the data is sent via the new IO method “Web IO”, the data will appear correctly on the end user’s screen.
Printing of Unicode data requires a printer which can handle Unicode output. As long as the printing is controlled by fonts (i.e. a Unicode font), the data will be printed correctly. This applies to all modern terminal printers. For mass output printers, the functionality needs to be checked. Older versions of mainframe printers may not support Unicode data.
In a Windows environment the Printer type has to be <GUI>, not <TTY>
58. Unicode Constants There is the U’constant’ feature available to define Unicode constants.
When you want to specify a constant value containing characters not in the current code page, the Natural Compiler will not accept that value when the U format has not been specified in the statement.
MOVE ‘My name is ????????’ will not be accepted
MOVE U ‘My name is ????????’ will be accepted
UH constant types are Unicode hexadecimal constant
This function can be used to move hexadecimal values into a Unicode field