130 likes | 269 Views
Charset to UTF. Good Old Old Days. Is there any other language but American ?? EBCDIC ASCII. Good Old Days. Ascii: 1-127 – latin 127-256 – French,Italian, German etc. or Greek or Hebrew or Russian etc. Multibyte.
E N D
Good Old Old Days Is there any other language but American ?? EBCDIC ASCII
Good Old Days Ascii: 1-127 – latin 127-256 – French,Italian, German etc. or Greek or Hebrew or Russian etc.
Multibyte • Japanese – SJIS, EUC • Chinese – Big5, GB • Korean
Babel’s Tower http://www.i18nguy.com/unicode/codepages.html#czyborra
Many Languages • Hebrew • Japanese • Arabic In the same doc/line/screen
Unicode • All Languages • Each char – 2 bytes – 63000+ • problem: Not string - wide char
UTF8 • One to one with Unicode • 1-3 regular chars • Well defined algorithm
Hebrew to Unicode 05D0 60 HEBREW LETTER ALEF05D1 61 HEBREW LETTER BET05D2 62 HEBREW LETTER GIMEL05D3 63 HEBREW LETTER DALET05D4 64 HEBREW LETTER HE05D5 65 HEBREW LETTER VAV05D6 66 HEBREW LETTER ZAYIN05D7 67 HEBREW LETTER HET05D8 68 HEBREW LETTER TET05D9 69 HEBREW LETTER YOD05DA 6A HEBREW LETTER FINAL KAF05DB 6B HEBREW LETTER KAF05DC 6C HEBREW LETTER LAMED05DD 6D HEBREW LETTER FINAL MEM05DE 6E HEBREW LETTER MEM and likewise for each charset
Need for Conversion • Existing Data • New data: Editors work in specific charsets, not in utf/unicode
Brute Force Foreach org_char convert to utf
Perl way 1 use ENCODE; ($if, $of)=@ARGV; open my $in, "<:encoding(iso-8859-8)", $if; open my $out, ">:encoding(utf8)", $of; while(<$in>) { print $out $_; } close $in;
Perl way 2 perl -MEncode -e '($if, $of)=@ARGV;open my $in, "<:encoding(iso-8859-8)", $if;open my $out, ">:encoding(utf8)", $of;while(<$in>){ print $out $_; }' infile outfile