220 likes | 407 Views
Unicode Text and Regular Expression. Andy Heninger 9/9/2004. Overview. Regular Expressions have long been used for Searching text data Parsing, extracting fields Text manipulation, find & replace Regular Expressions and Unicode Text data are a good Match.
E N D
Unicode Text and Regular Expression Andy Heninger 9/9/2004
Overview • Regular Expressions have long been used for • Searching text data • Parsing, extracting fields • Text manipulation, find & replace • Regular Expressions and Unicode Text data are a good Match. • Regular Expression Languages have evolved new features to work more conveniently and powerfully with Unicode data. • Talk Focus is on these Unicode related features.
What Are Regular Expressions • Think of Wildcards • Select or match text • Available in editors, languages, tools, databases • Not the topic today
Character Ranges • [a-z] • Match any one character falling in the specified range • Relies on the existence of some ordering of characters, to determine what falls between a and z. Typically charset order. • Only works for English • No accented characters • No letters from other alphabets (Greek, Arabic, etc.) • Still widely used.
POSIX Character Classes • Remove dependency on charset ordering • Convenient, more likely to be correct than [a-z] • [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:xdigit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] • Implementers must provide definitions for different charsets
POSIX -> Unicode • Unicode has a very rich character property system • Unicode TR 18 defines POSIX classes in terms of properties • [:alpha:] Alphabetic = TRUE • [:digit] General Category = Decimal Number • [:space:] White Space = TRUE • [:upper:] Uppercase = TRUE • Direct access to Unicode properties in Character Set expressions is a key feature for Unicode Regular Expression.
A Quick Look at the Unicode General Category • Central to Regular Expressions with Unicode Text • Categorize every character as one of • Letter • Number • Separator • Punctuation • Marks • Symbols • Others • Subcategories within each. Examples • Letter, Uppercase, lowercase, Other, … • Symbols, Math, Currency, Modifiers, … • Mark, spacing, non-spacing, enclosing
Unicode Property Based Character Classes • TR 18 Recommended Properties for Basic Unicode support includes • General Category • Script • Alphabetic • Uppercase • Lowercase • White Space • Examples: [:Script=Greek:] POSIX syntax [\p{Script=Greek}] Perl syntax [\p{Alphabetic}]
Set Operations • [^\p{Letter}] Negation • [\p{Letter}\p{Number}] Union • [\p{Letter}&\p{script=Cyrllic}] Intersection • [\p{Letter}-\p{Latin}] Difference • Important for a character set the size of Unicode.
Script and Block Properties • [\p{script=Thai}][\p{block=Thai}] • Unicode Script Property • Categorizes each character by script – Latin, Cyrillic, Arabic, etc. • Shared characters classified as “Common”. Numbers, punctuation, etc. • Not the same as Language. • Unicode Block Property • Categories by block – contiguous range of characters. • Basic Latin, Latin-1 Supplement, Latin Extended A, Latin Extended B • Greek, Hebrew, and more. • Has Limitations
Code Points, Code Units, UTF 8/16/32 • Matching happens on Code Points (0 – 10ffff) • UTF-8 bytes or UTF-16 Surrogate Halves not visible • Match results independent of encoding form. • Glitches • Implementations without surrogate support • Perl’s \x
Normalization • \p{Alphabetic} • n • \p{Non Spacing Mark} • …
Normalization • Approaches to the Problem • Data may be pre-normalized, nothing extra needed. • Use Normalization option, if available. • Application Normalizes the data first
Line Endings • Unicode has More • \u000A Line Feed\u000C Form Feed\u000D Carriage Return\u0085 Next Line (NEL)\u2028 Line Separator\u2029 Paragraph Separator\u000D \u000A CR/LF sequence • Matches normally stop at line ends, but overridable. • Line endings always match as a single character, including the CR/LF sequence • No \n sequence to match any line ending
Caseless Matching • Simple – one to one character relation between pattern and text being matched. • Full – one to many • German Sharp-S ß uppercases to ‘SS’ • Expensive in complexity of implementation, speed. • Existing implementations provide simple form only.
Grapheme Clusters • Definition: what a user would consider a character, or what would display as a single character. • Multi-codepoint Clusters • Base char + combining marks • Example: decomposed form of Ň • Hangul (Korean) syllables • Unicode-enabled regular expressions should provide • Match a grapheme cluster • Test whether match position is on a boundary.
Word Boundaries, \b • Classic RE Feature • Boundaries between “word” and “non-word” characters • “Word” characters include all Alphabetic. • Non-spacing marks never separated from base, otherwise ignored. • UAX 29 Boundaries • Better, but different, results. Hello There. G’day 123.456 Classic REHello There. G’day 123.456 Unicode Word Boundaries
Unicode TR 18 • Unicode Technical Standard #18, Regular Expressions • Guidelines for how to adapt RE implementations to Unicode • Three Levels of Support, Basic, Extended, Tailored • Basic Support requires • Access to common Unicode Character Properties • Set (character class) Operations – Union, Intersection, Subtraction • Simple Unicode Loose (caseless) matching • Unicode Line separator characters • Supplementary Character support • Hex notation for Unicode code points
Unicode TR 18 • Extended Unicode Support • More properties, characters by name. (GREEK CAPITAL LETTER EPSILON) • Canonical Equivalents (normalization) • Unicode style word boundaries • Full case insensitive matching • Matching default grapheme clusters and boundaries • Tailored Support. Language or Locale specific behavior for a number of matching constructs. • No implementations available yet.
Implementations • Implementations providing significant Unicode support • Perl. • Major innovations to regular expressions • Early adopter of Unicode • Perl features and syntax widely adopted. • Java JDK 1.4 • Microsoft .NET • IBM ICU4C
Conclusion • Regular expressions provide a great way to analyze and manipulate Unicode data. • Mainstream implementations are readily available.