1 / 22

Unicode Text and Regular Expression

Unicode Text and Regular Expression. Andy Heninger 9/9/2004. Overview. Regular Expressions have long been used for Searching text data Parsing, extracting fields Text manipulation, find & replace Regular Expressions and Unicode Text data are a good Match.

eze
Download Presentation

Unicode Text and Regular Expression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode Text and Regular Expression Andy Heninger 9/9/2004

  2. Overview • Regular Expressions have long been used for • Searching text data • Parsing, extracting fields • Text manipulation, find & replace • Regular Expressions and Unicode Text data are a good Match. • Regular Expression Languages have evolved new features to work more conveniently and powerfully with Unicode data. • Talk Focus is on these Unicode related features.

  3. What Are Regular Expressions • Think of Wildcards • Select or match text • Available in editors, languages, tools, databases • Not the topic today

  4. Character Ranges • [a-z] • Match any one character falling in the specified range • Relies on the existence of some ordering of characters, to determine what falls between a and z. Typically charset order. • Only works for English • No accented characters • No letters from other alphabets (Greek, Arabic, etc.) • Still widely used.

  5. POSIX Character Classes • Remove dependency on charset ordering • Convenient, more likely to be correct than [a-z] • [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:xdigit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] • Implementers must provide definitions for different charsets

  6. POSIX -> Unicode • Unicode has a very rich character property system • Unicode TR 18 defines POSIX classes in terms of properties • [:alpha:] Alphabetic = TRUE • [:digit] General Category = Decimal Number • [:space:] White Space = TRUE • [:upper:] Uppercase = TRUE • Direct access to Unicode properties in Character Set expressions is a key feature for Unicode Regular Expression.

  7. A Quick Look at the Unicode General Category • Central to Regular Expressions with Unicode Text • Categorize every character as one of • Letter • Number • Separator • Punctuation • Marks • Symbols • Others • Subcategories within each. Examples • Letter, Uppercase, lowercase, Other, … • Symbols, Math, Currency, Modifiers, … • Mark, spacing, non-spacing, enclosing

  8. Unicode Property Based Character Classes • TR 18 Recommended Properties for Basic Unicode support includes • General Category • Script • Alphabetic • Uppercase • Lowercase • White Space • Examples: [:Script=Greek:] POSIX syntax [\p{Script=Greek}] Perl syntax [\p{Alphabetic}]

  9. Set Operations • [^\p{Letter}] Negation • [\p{Letter}\p{Number}] Union • [\p{Letter}&\p{script=Cyrllic}] Intersection • [\p{Letter}-\p{Latin}] Difference • Important for a character set the size of Unicode.

  10. Script and Block Properties • [\p{script=Thai}][\p{block=Thai}] • Unicode Script Property • Categorizes each character by script – Latin, Cyrillic, Arabic, etc. • Shared characters classified as “Common”. Numbers, punctuation, etc. • Not the same as Language. • Unicode Block Property • Categories by block – contiguous range of characters. • Basic Latin, Latin-1 Supplement, Latin Extended A, Latin Extended B • Greek, Hebrew, and more. • Has Limitations

  11. Code Points, Code Units, UTF 8/16/32 • Matching happens on Code Points (0 – 10ffff) • UTF-8 bytes or UTF-16 Surrogate Halves not visible • Match results independent of encoding form. • Glitches • Implementations without surrogate support • Perl’s \x

  12. Normalization • \p{Alphabetic} • n • \p{Non Spacing Mark} • …

  13. Normalization • Approaches to the Problem • Data may be pre-normalized, nothing extra needed. • Use Normalization option, if available. • Application Normalizes the data first

  14. Line Endings • Unicode has More • \u000A Line Feed\u000C Form Feed\u000D Carriage Return\u0085 Next Line (NEL)\u2028 Line Separator\u2029 Paragraph Separator\u000D \u000A CR/LF sequence • Matches normally stop at line ends, but overridable. • Line endings always match as a single character, including the CR/LF sequence • No \n sequence to match any line ending 

  15. Caseless Matching • Simple – one to one character relation between pattern and text being matched. • Full – one to many • German Sharp-S ß uppercases to ‘SS’ • Expensive in complexity of implementation, speed. • Existing implementations provide simple form only.

  16. Grapheme Clusters • Definition: what a user would consider a character, or what would display as a single character. • Multi-codepoint Clusters • Base char + combining marks • Example: decomposed form of Ň • Hangul (Korean) syllables • Unicode-enabled regular expressions should provide • Match a grapheme cluster • Test whether match position is on a boundary.

  17. Word Boundaries, \b • Classic RE Feature • Boundaries between “word” and “non-word” characters • “Word” characters include all Alphabetic. • Non-spacing marks never separated from base, otherwise ignored. • UAX 29 Boundaries • Better, but different, results. Hello There. G’day 123.456 Classic REHello There. G’day 123.456 Unicode Word Boundaries

  18. Unicode TR 18 • Unicode Technical Standard #18, Regular Expressions • Guidelines for how to adapt RE implementations to Unicode • Three Levels of Support, Basic, Extended, Tailored • Basic Support requires • Access to common Unicode Character Properties • Set (character class) Operations – Union, Intersection, Subtraction • Simple Unicode Loose (caseless) matching • Unicode Line separator characters • Supplementary Character support • Hex notation for Unicode code points

  19. Unicode TR 18 • Extended Unicode Support • More properties, characters by name. (GREEK CAPITAL LETTER EPSILON) • Canonical Equivalents (normalization) • Unicode style word boundaries • Full case insensitive matching • Matching default grapheme clusters and boundaries • Tailored Support. Language or Locale specific behavior for a number of matching constructs. • No implementations available yet.

  20. Implementations • Implementations providing significant Unicode support • Perl. • Major innovations to regular expressions • Early adopter of Unicode • Perl features and syntax widely adopted. • Java JDK 1.4 • Microsoft .NET • IBM ICU4C

  21. Conclusion • Regular expressions provide a great way to analyze and manipulate Unicode data. • Mainstream implementations are readily available.

  22. Questions

More Related