1 / 92

UTF-8, Perl and You

UTF-8, Perl and You. By Rafael Almeria. Chapter 1: Introduction. 1 - Introduction. This talk does not deal with the motivation for using utf-8. 1 - Introduction. This talk is about: Implementation details. Understanding UTF-8. Converting your data, And knowing how to fix common problems.

williejones
Download Presentation

UTF-8, Perl and You

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UTF-8, Perl and You • By Rafael Almeria

  2. Chapter 1:Introduction

  3. 1 - Introduction This talk does not deal with themotivation for using utf-8.

  4. 1 - Introduction • This talk is about: • Implementation details. • Understanding UTF-8. • Converting your data, • And knowing how to fix common problems.

  5. 1 - Introduction • Some assumptions: • Language: Perl • Unix Operating System • Input encoded as: ASCII, ISO-8859-1/Latin-1 or Windows-1252. • Output encoded as: UTF-8

  6. 1 - Introduction • What we’ll cover in this talk: • A primer on character encoding • A simplifying principle • UTF-8 • Perl & UTF-8 • Making the Browser Happy • Encoding Hell

  7. Chapter 2:A Very Brief Primer on Character Encoding.

  8. 2 - A Very Brief Primer on Character Encoding. What is a character encoding?

  9. 2 - A Very Brief Primer on Character Encoding. It’s a specific way to represent the characters in a given character set.

  10. 2 - A Very Brief Primer on Character Encoding. A character set may have a numerical ordering on it for use with a given character encoding.

  11. 2 - A Very Brief Primer on Character Encoding. The number given to a specific character in an ordered character set is its code point.

  12. 2 - A Very Brief Primer on Character Encoding. Do not confuse the character’s code point with its representation!

  13. 2 - A Very Brief Primer on Character Encoding. It may be the same for ASCII, ISO-8859-1 and Windows-1252 and…

  14. 2 - A Very Brief Primer on Character Encoding. it may be the same for 1-byte UTF-8 but…

  15. 2 - A Very Brief Primer on Character Encoding. it’s definitely not true for multi-byte UTF-8.

  16. 2 - A Very Brief Primer on Character Encoding. It’s a common problem. So don’t confuse them!

  17. Chapter 3:A Simplifying Principle

  18. 3 - A Simplifying Principle • If all of our data is encoded using only the following encodings (code point ranges are in parenthesis): • ASCII (0x00 - 0x7F) • ISO-8859-1/Latin-1 (0x00 - 0xFF) • Windows-1252 (0x00 - 0xFF)

  19. 3 - A Simplifying Principle and if we only care about printable content thenASCII  ISO-8859-1  Windows-1252

  20. 3 - A Simplifying Principle We can treat everything as Windows-1252!

  21. 3 - A Simplifying Principle This should be ok if we are sure that the documents are from one of these three kinds of encodings but we’re not sure how each document is encoded.

  22. Chapter 4: UTF-8.A Brave New World

  23. 4 - UTF-8. A Brave New World It supports every language you’ll probably ever need.

  24. 4 - UTF-8. A Brave New World No need for Windows-1252 this and Windows-1253 that.

  25. 4 - UTF-8. A Brave New World Its code point range is from 0x00 to 0x10FFFF

  26. 4 - UTF-8. A Brave New World It uses a variable (1 to 4) byte encoding.

  27. 4 - UTF-8. A Brave New World 1-byte UTF-8 is used for code points in the range 0x00 to 0x7F.

  28. 4 - UTF-8. A Brave New World 1-byte UTF-8  ASCIIMSBit is 0code point  representation

  29. 4 - UTF-8. A Brave New World • Examples of 1-byte UTF-8: • “A” -> 0100 0001 • “&” -> 0010 0110 • “5” -> 0011 0101

  30. 4 - UTF-8. A Brave New World 2-byte UTF-8 is used for code points in the range 0x0080 to 0x07FF.

  31. 4 - UTF-8. A Brave New World 2-byte UTF-8code point != representation

  32. 4 - UTF-8. A Brave New World The code point is broken apart into two pieces.

  33. 4 - UTF-8. A Brave New World The five MSBits of the code point are assigned to the first byte and the six LSBits are assigned to the second byte.

  34. 4 - UTF-8. A Brave New World For the first byte of 2-byte UTF-8The three MSBits are set to 110The remaining bits are the five MSBits of the code point.

  35. 4 - UTF-8. A Brave New World For the second byte of 2-byte UTF-8The two MSBits are set to 10The remaining bits are the six LSBits of the code point.

  36. 4 - UTF-8. A Brave New World 3-byte UTF-8 is used for code points in the range 0x0800 to 0xFFFF.

  37. 4 - UTF-8. A Brave New World 3-byte UTF-8code point != representation

  38. 4 - UTF-8. A Brave New World The code point is broken apart into three pieces.

  39. 4 - UTF-8. A Brave New World • The four MSBits of the code point are assigned to the first byte. • The middle six bits are assigned to the second byte. • The six LSBits are assigned to the third byte.

  40. 4 - UTF-8. A Brave New World For the first byte of 3-byte UTF-8The four MSBits are set to 1110The remaining bits are the four MSBits of the code point.

  41. 4 - UTF-8. A Brave New World For the second byte of 3-byte UTF-8The two MSBits are set to 10The remaining bits are the six middle bits of the code point.

  42. 4 - UTF-8. A Brave New World For the third byte of 3-byte UTF-8The two MSBits are set to 10The remaining bits are the six LSBits of the code point.

  43. 4 - UTF-8. A Brave New World 4-byte UTF-8 is used for code points in the range 0x10000 to 0x10FFFF.

  44. 4 - UTF-8. A Brave New World 4-byte UTF-8code point != representation

  45. 4 - UTF-8. A Brave New World The code point is broken apart into four pieces.

  46. 4 - UTF-8. A Brave New World • The three MSBits of the code point are assigned to the first byte. • The next six MSBits are assigned to the second byte. • Another of the next six MSBits are assigned to the third byte. • The six LSBits are assigned to the fourth byte.

  47. 4 - UTF-8. A Brave New World For the first byte of 4-byte UTF-8The five MSBits are set to 11110The remaining bits are the three MSBits of the code point.

  48. 4 - UTF-8. A Brave New World For the second byte of 4-byte UTF-8The two MSBits are set to 10The remaining bits are the next six middle bits of the code point.

  49. 4 - UTF-8. A Brave New World For the third byte of 4-byte UTF-8The two MSBits are set to 10The remaining bits are the next six middle bits of the code point.

  50. 4 - UTF-8. A Brave New World For the fourth byte of 4-byte UTF-8The two MSBits are set to 10The remaining bits are the six LSBits of the code point.

More Related