670 likes | 691 Views
Dyalog’08. Migrating to Unicode. Morten Kromberg Workshop at Dyalog’08 - Elsinore. Agenda. What is Unicode? V.12 Design Goals Key Unicode Features Language Differences ⎕DR , ⍋ of char data Space & Performance ”Interop”: Classic vs Unicode WSs & Component Files TCP Sockets & Conga
E N D
Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore
Agenda • What is Unicode? • V.12 Design Goals • Key Unicode Features • Language Differences • ⎕DR, ⍋ of char data • Space & Performance • ”Interop”: Classic vs Unicode • WSs & Component Files • TCP Sockets & Conga • External Vars, Mapped Files • Own DLLs and Aps • Native Files • Unicode Text Files (UTF-8) • External Interfaces • COM/OLE, Microsoft.NET • ODBC / SQAPL • ⎕NA: A&W win32 calls • Source Code Management • SALT, SubVersion, Diff Tools • Planning Migrations Dyalog’08 - Elsinore
What is Unicode? Wikipedia: An industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. • It assigns a number, or code point, to each of approximately 100,000 characters • Including the APL character set. • The first version of the standard appeared in 1991, support is now becoming “common” on all platforms Dyalog’08 - Elsinore
Why do we want Unicode? • Obviously: It allows us to write applications which use text from all the world’s written languages… • Less obviously, but perhaps more important in the short term: • APL no longer needs it’s own character set (“Atomic Vector”) • Characters no longer need to be translated on the way in and out of APL • APL Source Code can be stored in “ordinary” text files and be handled by “standard” management tools Dyalog’08 - Elsinore
Most often, when someone tells you the data ”is Unicode”, they mean ”UTF-8 encoded”. What is Unicode in practice? Dyalog’08 - Elsinore
Use Google... Dyalog’08 - Elsinore
Wikipedia too ... Dyalog’08 - Elsinore
”Unicode” assigns unique numbers to characters.Encodings are ways to represent these numbers on file. Encodings • UCS (Universal Character Set) encodings have a fixed width,UTF (Unicode Transformation Format) encodings are variable width. Dyalog’08 - Elsinore
Version 12.0 Design Goals • To allow users to develop Unicode applications (containing all the worlds symbols) • To make the Dyalog IDE a Unicode application • No more ”translate tables”! • Avoid having to explain ⎕AV to future generations • Only one ”kind” of characters • Design should encourage migration • Controlled migration with ”interop” between old & new apps • No ”Big Bang” data conversion events • Classic & Unicode editions allow ”parallel runs” Dyalog’08 - Elsinore
Unicode vs Classic • Unicode Edition: • Character data is defined as Unicode code points • No translation of data as it moves in & out of APL • Classic Edition: • Character data is defined as indices into ⎕AV • Translate tables used for keyboard, display and file I/O • Classic will be available so long as a single major customer has not been able to migrate • The price may increase at some point Dyalog’08 - Elsinore
Key Unicode Features (1) • New Character Data Types 80, 160, 320: 1-, 2-, 4-byte representations of Code Points. ⎕DR 'Hello' 80 ⎕DR '{⍺+⍵}' 160 ⎕DR '𠀁𠀂𠀃' 320 • NB: One character = one array element! Dyalog’08 - Elsinore
Key Unicode Features (2) • Monadic ⎕UCS converts to and from code points (self inverse): ⎕UCS 'Hello'72 101 108 108 111 ⎕UCS '{⍺+⍵}' 123 9082 43 9077 125 ⎕UCS (2*17)+⍳3 𠀁𠀂𠀃 Dyalog’08 - Elsinore
Key Unicode Features (3) • Dyadic ⎕UCS encodes and decodes data as UTF-8, UTF-16 or UTF-32: 'UTF-8' ⎕UCS 'ABCÆØÅ' 65 66 67 195 134 195 152 195 133 'UTF-8' ⎕UCS 240 160 128 129, 240 160 128 130, 240 160 128 131 𠀁𠀂𠀃 'UTF-16' ⎕UCS '𠀁𠀂𠀃' 55360 56321 55360 56322 55360 56323 Dyalog’08 - Elsinore
Demo 1 ... (key features) Dyalog’08 - Elsinore
Language Differences • If you are only using APL workspaces, and component files, most code from earlier versions will just load & run • Potential problems are: • Monadic ⍋ (only real language difference) • ⎕DR to test for character data • Dyadic use of ⎕DR to ”cast” data • Space usage (char arrays can be larger) Dyalog’08 - Elsinore
Monadic ⍋ • Due to differences in the internal representation, upgrade without a collation sequence may return different results: • Give ⍋ a left argument of ⎕AV to maintain the current behaviour • In many cases where monadic use, ⍋ order does not matter Dyalog’08 - Elsinore
Testing for Character Data • This no longer works as expected:82=⎕DR X • Dyalog recommends:(10|⎕DR ⍵)∊0 2 • The latter is correct in all versions Dyalog’08 - Elsinore
Dyadic ⎕DR for ”Casting” • Classic (and previous versions): 83 ⎕DR '⍋' ⍝ ⎕AV[⎕IO+198] ¯109 ⍝ Via APL+Win tables • Unicode: 83 ⎕DR '⍋' ⍝ ⎕UCS 9035 75 35 ⍝ 9035 = 256⊥⌽75 35 • The internal representation is different, and Unicode does NO TRANSLATION • Code which (e.g.) reads characters from native files and then ”casts” to number using ⎕DR needs work Dyalog’08 - Elsinore
More on ⎕DR ... (and ⎕UCS) • Unicode Edition still recognises 82 as an left argument: 82 ⎕DR ¯109 ⍋ • This returns the same character as in Classic. But: ⎕DR 82 ⎕DR ¯109 160 ⍝ Type 82 cannot exist in Unicode • Conversely, ⎕UCS exists in Classic: ⎕UCS 9035 ⍋ ⎕UCS 180 ⍝ But must return elements of ⎕AV TRANSLATION ERROR ⍝ Cannot convert to type 82 Dyalog’08 - Elsinore
Space and Time • Character data will require 2 bytes per element in the Unicode Edition, if it contains APL symbols. No existing APL arrays can need 4 bytes per element. • Primitives which manipulate or search this data may run more slowly (more data to sift through). • Comments and character constants in code, and the script form of namespaces and classes, is also affected Dyalog’08 - Elsinore
Time and Space • When copying functions between Classic and Unicode, the format needs to be converted – this can be expensive. • The same applies when reading a ⎕OR “across the line”. • It is not recommended to dynamically import functions across the Classic/Unicode boundary in production applications. • Some VERY LARGE functions which could fix in v11.0 may not fix in the Unicode Edition: Lists of names and constants in a function share space with comments. • Proposal to relax all limits on functions may be executed for version 12.1 Dyalog’08 - Elsinore
Unicode vs Classic • Use the Unicode Edition if: • You want to develop new applications • You need to manage characters not in ⎕AV now. • Use the Classic Edition if: • You need other v12+ enhancements, but are not ready to convert to Unicode yet • Classic is upwards compatible with v11.0 (as usual) • UE and CE are maintained from single source, and are ”identical” except for character arrays. • Start planning your migration now! (please!) Dyalog’08 - Elsinore
So you want to migrate soon... • If you ”only use APL” (workspaces, component files, sockets), applications SHOULD just load & run • If you • Fell for the temptation to use any external tools or storage media as part of your application • Wrote your own AP’s or DLL’s • Or want to start using data not in ⎕AV ... you may have a little work to do. Let’s take a look! Dyalog’08 - Elsinore
”Interop” • Unicode and Classic editions are designed to inter-operate seamlessly – also with v11 & v10.1 • 12.0 Classic can read and translate Unicode character data found in files, workspaces and on TCP sockets • Unicode editions will translate data to type 82 when using TCP Sockets and Component files flagged as non-Unicode (for interop with v11 & v10.1) • If Unicode data contains characters not in ⎕AV => TRANSLATION ERROR • Unicode editions still recognise 82 as a valid argument to ⎕DR and native file functions, and are able to map data in old native files to ”the same character”. Dyalog’08 - Elsinore
”Interop” • The intention is that users should be able to perform controlled experiments when migrating to Unicode • No ”Big Bang” data conversion events; old files and workspaces can still be read • We hope that users will ”reciprocate” by moving as quickly as possibly; it is as easy as we could make it! Dyalog’08 - Elsinore
Workspaces • Classic and Unicode editions can load each others workspaces, but: • Classic cannot load (or COPY from) a workspace containing characters not in ⎕AV (TRANSLATION ERROR) • The contents of ⎕AV are defined by ⎕AVU, a list of 256 Unicode Code Points: ⎕AV[97+⍳26] ⍝ By default in v12.0, "Dyalog Alt" ÁÂÃÇÈÊËÌÍÎÏÐÒÓÔÕÙÚÛÝþãìðòõ ⎕AVU[97+⍳26]←9397+⍳26 ⍝ Underscored alphabet (sort of) ⎕AV[97+⍳26] ⍝ Now we have "Dyalog Std” mapping ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ • When )COPYing from a pre-v12 workspace, ⎕AVU in the target namespace decides how incoming character data is translated. So code written using Alt & Std can be merged and maintain the original looks. Dyalog’08 - Elsinore
More on ⎕AVU • The Dyalog Std font is still in some older (”anglo”) applications • Dyalog Alt is used across Western Europe • Some countries use fonts created by local distributors: )copy avu Russian.⎕AVU C:\...avu saved Fri Jun 27 10:00:52 2008 3 50⍴65↓⎕AV ABCDEFGHIJKLMNOPQRSTUVWXYZАБВГД⍙ЕЖЗИЙКЛМНОПРСТУФХЦ ЧШЩЪЫЬЭЮ{€}⊣⌷¨Яабв⍨гдежзийклмнопрстуфхцч[/⌿\⍀<≤=≥> ≠∨∧-+÷×?∊⍴~↑↓⍳○*⌈⌊∇∘(⊂⊃∩∪⊥⊤|;,⍱⍲⍒⍋⍉⌽⊖⍟⌹!⍕⍎⍫⍪≡≢шщъы • The translate table is also used when reading component files and APL data arriving on TCP Sockets • It has namespace scope, so classes or namespaces can be defined to read data from Classic systems using different languages if necessary Dyalog’08 - Elsinore
Underscores Must Die! • There is no Underscored alphabet in Unicode. Underscoring is a form ”emphasis” (like bold or italic). The underscored alphabet is the ONLY incompatibility with the rest of the world and should be phased OUT. • The APL385 Unicode font incorrectly displays underscores for code points 9398-9423 (decimal). The positions should really display as Ⓐ..Ⓩ. • (Don’t ask why circled alphabetics ARE in unicode, while underscores are not – but Dyalog decided to map underscores to this range) Dyalog’08 - Elsinore
⎕AV: Just another variable • In the Unicode Edition, the Atomic Vector is only used to define how to inter-operate with Classic systems. Only characters in ⎕AV can be shared. Assuming the default (Alt) setting: 'ÁⒶ'∊⎕AV 1 0 • System variable ⎕Ⓐ (name now displays as ⎕Á) should no longer be used. It continues to exist and returns ⎕AV[97+⍳26] Dyalog’08 - Elsinore
Chars Allowed in Names • The list has not been extended, the following are allowed:0123456789 (but not as the 1st character in a name) ABCDEFGHIJKLMNOPQRSTUVWXYZ_ abcdefghijklmnopqrstuvwxyz ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß àáâãäåæçèéêëìíîïðñòóôõöøùúûüþ ∆⍙ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ • In a standard font, underscores display as Ⓐ to Ⓩ • I Unicode, all of the above can now be used simultaneously (previously, the available set depended on whether the Alt or Std font was selected). Russian letters are NOT allowed. Dyalog’08 - Elsinore
Component File Interop • Like workspaces, Component Files can be shared between Classic and Unicode editions. • The same restriction applies: Classic cannot read arrays containing characters not in ⎕AV. • Files can be marked as non-Unicode, in which case Unicode cannot write characters not in ⎕AV. • All ”small” (32-bit) component files are non-Unicode • For ordinary APL arrays (no ⎕ORs), the Unicode edition can share files with old versions of APL too. Dyalog’08 - Elsinore
File Properties • New system function ⎕FPROPS allows you to control whether a file may contain Unicode data: 'c:\temp\smallfile' ⎕FCREATE 32 32 'EJSU' ⎕FPROPS 1 ⍝ Endian, Journaled, Size, Unicode 0 0 32 0 'c:\temp\bigfile’ ⎕FCREATE 64 64 'EJSU' ⎕FPROPS 64 0 0 64 1 • Size defaults to 64 from v12.0 (new startup flag –F32/-F64) • Small address size (32-bit) files are limited to 4Gb in size and can NOT have the Unicode bit set • Setting Journaling on prevents sharing with v11.0 or earlier Dyalog’08 - Elsinore
Translation Error on Write • Unicode edition can write to non-Unicode component files: '{⍺+⍵}' ⎕FAPPEND 32 ⍝ ∧/'{⍺+⍵}'∊⎕AV – fine! 'U' 0 ⎕FPROPS 64 ⍝ Switch Unicode OFF '𠀁𠀂𠀃' ⎕FAPPEND 64 ⍝ Chars not in ⎕AV TRANSLATION ERROR 'U' 1 ⎕FPROPS 32 ⍝ Not allowed for small files TRANSLATION ERROR • If non-Unicode files do not contain namespaces or ⎕ORs, v10.1 and v11.0 can use them • Note: Large files (64-bit) cannot be used with versions 10.0 or earlier. Dyalog’08 - Elsinore
TCP Socket / Conga Interop • TCPSocket objects have an Encoding property: • The default is None for Char, and Classic for APL • APL sockets are non-Unicode by default to avoid crashing down-version APL interpreters receiving Unicode data • Conga always sends data in ”native” form, receive will fail with a TRANSLATION ERROR if data cannot be represented Dyalog’08 - Elsinore
External Variables • External Variables are implemented as small span component files (32-bit files) – and can thus NOT contain Unicode data: 'c:\temp\xvar’ ⎕XT'x' x Hello World x←'𠀁𠀂𠀃' TRANSLATION ERROR • External Variables should be seen as a ”deprecated” feature: You will still be able to use existing external variables, but should plan to convert to component files or mapped files at your convenience. Dyalog’08 - Elsinore
Mapped Files • Like external variables, the use of APL mapped files (containing APL arrays with header information) should be seen as a deprecated feature. • Convert to using other mechanisms at your earliest convenience. • Support for RAW mapped files (where type information is provided when mapping) remains core functionality (and will probably get more important in a world of multicore machines): 32↓102↑80 ¯1 ⎕MAP'c:\Program Files\ComfortKeyboard\changes.txt' Added new interface languages: Latvian, Brazilian Portuguese, Italian. • Type 82 is NOT supported in the Unicode Edition: Mapped variables are ”in the workspace” and cannot be translated on access. • To read a raw file written using data type 82, map with data type 83 and the characters extracted by indexing into ⎕AVU. Dyalog’08 - Elsinore
(Own) DLLs and APs • The format for passing APL arrays to Libraries and Auxiliary Processors is unchanged, except that a Unicode Edition will pass character arrays of type 80, 160 or 320 • Dyalog-provided libraries have been upgraded. A number of old Aps like PREFECT are no longer shipped, but v11 versions will continue to work fine with the Classic Edition. • If you have written your own APs or DLLs which handle character data, these need to be updated to deal with new data types. • You can return any of the Classic or Unicode character types, they will be translated (subject to the usual TRANSLATION ERROR limitations). Dyalog’08 - Elsinore
Native Files • Unicode Edition also still supports type 82, so that old files containing APL characters can be used. They mapping to the ”same characters” - but with a different internal representation: V11: 'c:\temp\plus'⎕NCREATE ¯1 '{⍺+⍵}' ⎕nappend ¯1 V12: ⎕DR ⎕←⎕NREAD ¯1 82 5 0 {⍺+⍵} 160 Dyalog’08 - Elsinore
Native Files & Unicode • Unicode Edition supports new data types 80, 160, 320 – reading or writing 1, 2 or 4 bytes at a time (file is UCS-1, -2 or -4 encoded). • Code Change Possibly Required: The DEFAULT TYPE when appending character arrays is now 80 (was 82): 'plus:’ ⎕NAPPEND ¯2 ⍝ Type 80 (all ANSI) '{⍺+⍵}' ⎕NAPPEND ¯1 ⍝ Type 160 (APL chars) DOMAIN ERROR ⍝ Data cannot be narrowed • Early Beta versions of 12.0 used the type of the left argument, but this lead to variable numbers of bytes being used when writing depending on the content of an array (160 if a non-ANSI character included). • If you need to write text containing APL to a native file, use type 160 – or perhaps better, use UTF-8! Dyalog’08 - Elsinore
Native Files & UTF-8 • The most common way to store Unicode data in text files is to encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications. text←'plus←{⍺+⍵}' 'UTF-8' ⎕UCS 'plus' 112 108 117 115 'c:\temp\plus.txt' ⎕NCREATE ¯1 (⎕UCS 'UTF-8' ⎕UCS 'plus') ⎕NAPPEND ¯1 ⎕CMD 'notepad c:\temp\plus.txt' 'normal’ • Windows Notepad is able to detect that the file is UTF-8 encoded and displays the text correctly. • The monadic ⎕UCS on the left converts integers in the range 0-255 into one-byte Unicode characters before appending. Integers above 127 would become type 163 (2 bytes per element). Dyalog’08 - Elsinore
Native Files & UTF-8 • The most common way to store Unicode data in text files is to encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications. • UCS-2 (2 bytes per character) is supported by many Microsoft apps (like Visual Studio). UCS-2 was the standard until Windows 2000 – now replaced by UTF-16, which is identical to UCS-2 for most data, but expands to 4 bytes when required. • Applications need to know which encoding has been used. Two common methods of indicating this are ”Byte Order Marks” at the beginning of the file, and (for web pages) HTTP tags. Dyalog’08 - Elsinore
By convention, the first few bytes of text files are sometimes (but not always) an encoding of U+FEFF, the ”Byte Order Mark”, also known as ”Zero width no-break space”: This convention allows applications to ”guess” the encoding used: Byte Order Mark • The convention is more common under Windows than Unix/Linux. Sometimes writing the BOM makes things worse... Dyalog’08 - Elsinore
Reading Text Files ∇ Chars←ReadFile name;nid;signature;nums [1] ⍝ Read ANSI or Unicode character file (Windows) [2] nid←name ⎕NTIE 0 [3] signature←3↑⎕NREAD nid 83 3 0 [4] :If signature≡¯17 ¯69 ¯65 ⍝ UTF-8 (EF BB BF) [5] Chars←⎕NREAD nid 80(¯3+⎕NSIZE nid) 3 [6] Chars←'UTF-8' ⎕UCS ⎕UCS Chars [7] :ElseIf (2↑signature)≡¯1 ¯2 ⍝ LittleEnd UTF-16 (FF FE) [8] Chars←⎕NREAD nid 160(¯1+⎕NSIZE nid)2 [9] :Else ⍝ ANSI [10] Chars←⎕NREAD nid 80(⎕NSIZE nid)0 [11] :EndIf [12] ⎕NUNTIE nid ∇ Dyalog’08 - Elsinore
Writing Text Files Dyalog’08 - Elsinore
Writing a UTF-8 Web Page html←'<html>',NL,' <head>',NL html,←' <meta http-equiv="content-type" content="text/html; charset=UTF-8" />' html,←’ </head>',NL,'<body>',NL html,←’ <font face="APL385 Unicode">' html,←'plus←{⍺+⍵}</font>',NL html,←'</body>',NL,'</html>',NL 'c:\temp\plus.htm'⎕NCREATE ¯1 (⎕UCS 'UTF-8' ⎕UCS html) ⎕NAPPEND ¯1 ⎕NUNTIE ¯1 ⎕CMD 'iexplore c:\temp\plus.htm' '' Dyalog’08 - Elsinore
Web Page: Results Dyalog’08 - Elsinore
UTF-8 Files with .NET Dyalog’08 - Elsinore
UTF-8 Files with .NET apltxt←⎕SE.SALT.New 'C:\..\UTF8File' 'c:\temp\apl.txt' apltxt.Text Compute average in APL: avg←{(+/⍵)÷⍴⍵} apltxt.Text,←⊂'⍝ Morten was here’ System.Text.Encoding.⎕nl -2 ASCII BigEndianUnicode Default Unicode UTF32 UTF7 UTF8 Dyalog’08 - Elsinore