Charset to UTF

Charset to UTF

Good Old Old Days Is there any other language but American ?? EBCDIC ASCII

Good Old Days Ascii: 1-127 – latin 127-256 – French,Italian, German etc. or Greek or Hebrew or Russian etc.

Multibyte • Japanese – SJIS, EUC • Chinese – Big5, GB • Korean

Babel’s Tower http://www.i18nguy.com/unicode/codepages.html#czyborra

Many Languages • Hebrew • Japanese • Arabic In the same doc/line/screen

Unicode • All Languages • Each char – 2 bytes – 63000+ • problem: Not string - wide char

UTF8 • One to one with Unicode • 1-3 regular chars • Well defined algorithm

Hebrew to Unicode 05D0 60 HEBREW LETTER ALEF05D1 61 HEBREW LETTER BET05D2 62 HEBREW LETTER GIMEL05D3 63 HEBREW LETTER DALET05D4 64 HEBREW LETTER HE05D5 65 HEBREW LETTER VAV05D6 66 HEBREW LETTER ZAYIN05D7 67 HEBREW LETTER HET05D8 68 HEBREW LETTER TET05D9 69 HEBREW LETTER YOD05DA 6A HEBREW LETTER FINAL KAF05DB 6B HEBREW LETTER KAF05DC 6C HEBREW LETTER LAMED05DD 6D HEBREW LETTER FINAL MEM05DE 6E HEBREW LETTER MEM and likewise for each charset

Need for Conversion • Existing Data • New data: Editors work in specific charsets, not in utf/unicode

Brute Force Foreach org_char convert to utf

Perl way 1 use ENCODE; ($if, $of)=@ARGV; open my $in, "<:encoding(iso-8859-8)", $if; open my $out, ">:encoding(utf8)", $of; while(<$in>) { print $out $_; } close $in;

Perl way 2 perl -MEncode -e '($if, $of)=@ARGV;open my $in, "<:encoding(iso-8859-8)", $if;open my $out, ">:encoding(utf8)", $of;while(<$in>){ print $out $_; }' infile outfile

Charset to UTF

Charset to UTF

Presentation Transcript

<?xml version="1.0" encoding="utf-8"?>

<?xml version="1.0" encoding="utf-8"?>

From UCS-2 to UTF-16

<!DOCTYPE html><html><head >< meta charset= "utf-8" /> < script> function co

Information on the structure and function of “UTF” habitats

Taint Tracking Through UTF Extension

<?xml version="1.0" encoding="utf-8"?>

<?xml version="1.0" encoding="UTF-8"?>

UTF Proposal: Water Quality Assessment Instrumentation Ann J. Murkowski

<?xml version="1.0" encoding="utf-8"?>

UTF-8, Perl and You

Agile Business Suite Marketing Activities in Europe UTF, Queenstown, May 2008

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="utf-8"?>

UTF-8, Perl and You

Charset to UTF

Charset to UTF

Presentation Transcript

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;

From UCS-2 to UTF-16

&lt;!DOCTYPE html&gt;&lt;html&gt;&lt;head &gt;&lt; meta charset= &quot;utf-8&quot; /&gt; &lt; script&gt; function co

Information on the structure and function of “UTF” habitats

Taint Tracking Through UTF Extension

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;

UTF Proposal: Water Quality Assessment Instrumentation Ann J. Murkowski

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;

UTF-8, Perl and You

Agile Business Suite Marketing Activities in Europe UTF, Queenstown, May 2008

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;

UTF-8, Perl and You

<?xml version="1.0" encoding="utf-8"?>

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html><html><head >< meta charset= "utf-8" /> < script> function co

<?xml version="1.0" encoding="utf-8"?>

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="utf-8"?>

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="utf-8"?>