Unicode Text and Regular Expression

Unicode Text and Regular Expression Andy Heninger 9/9/2004

Overview • Regular Expressions have long been used for • Searching text data • Parsing, extracting fields • Text manipulation, find & replace • Regular Expressions and Unicode Text data are a good Match. • Regular Expression Languages have evolved new features to work more conveniently and powerfully with Unicode data. • Talk Focus is on these Unicode related features.

What Are Regular Expressions • Think of Wildcards • Select or match text • Available in editors, languages, tools, databases • Not the topic today

Character Ranges • [a-z] • Match any one character falling in the specified range • Relies on the existence of some ordering of characters, to determine what falls between a and z. Typically charset order. • Only works for English • No accented characters • No letters from other alphabets (Greek, Arabic, etc.) • Still widely used.

POSIX Character Classes • Remove dependency on charset ordering • Convenient, more likely to be correct than [a-z] • [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:xdigit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] • Implementers must provide definitions for different charsets

POSIX -> Unicode • Unicode has a very rich character property system • Unicode TR 18 defines POSIX classes in terms of properties • [:alpha:] Alphabetic = TRUE • [:digit] General Category = Decimal Number • [:space:] White Space = TRUE • [:upper:] Uppercase = TRUE • Direct access to Unicode properties in Character Set expressions is a key feature for Unicode Regular Expression.

A Quick Look at the Unicode General Category • Central to Regular Expressions with Unicode Text • Categorize every character as one of • Letter • Number • Separator • Punctuation • Marks • Symbols • Others • Subcategories within each. Examples • Letter, Uppercase, lowercase, Other, … • Symbols, Math, Currency, Modifiers, … • Mark, spacing, non-spacing, enclosing

Unicode Property Based Character Classes • TR 18 Recommended Properties for Basic Unicode support includes • General Category • Script • Alphabetic • Uppercase • Lowercase • White Space • Examples: [:Script=Greek:] POSIX syntax [\p{Script=Greek}] Perl syntax [\p{Alphabetic}]

Set Operations • [^\p{Letter}] Negation • [\p{Letter}\p{Number}] Union • [\p{Letter}&\p{script=Cyrllic}] Intersection • [\p{Letter}-\p{Latin}] Difference • Important for a character set the size of Unicode.

Script and Block Properties • [\p{script=Thai}][\p{block=Thai}] • Unicode Script Property • Categorizes each character by script – Latin, Cyrillic, Arabic, etc. • Shared characters classified as “Common”. Numbers, punctuation, etc. • Not the same as Language. • Unicode Block Property • Categories by block – contiguous range of characters. • Basic Latin, Latin-1 Supplement, Latin Extended A, Latin Extended B • Greek, Hebrew, and more. • Has Limitations

Code Points, Code Units, UTF 8/16/32 • Matching happens on Code Points (0 – 10ffff) • UTF-8 bytes or UTF-16 Surrogate Halves not visible • Match results independent of encoding form. • Glitches • Implementations without surrogate support • Perl’s \x

Normalization • \p{Alphabetic} • n • \p{Non Spacing Mark} • …

Normalization • Approaches to the Problem • Data may be pre-normalized, nothing extra needed. • Use Normalization option, if available. • Application Normalizes the data first

Line Endings • Unicode has More • \u000A Line Feed\u000C Form Feed\u000D Carriage Return\u0085 Next Line (NEL)\u2028 Line Separator\u2029 Paragraph Separator\u000D \u000A CR/LF sequence • Matches normally stop at line ends, but overridable. • Line endings always match as a single character, including the CR/LF sequence • No \n sequence to match any line ending 

Caseless Matching • Simple – one to one character relation between pattern and text being matched. • Full – one to many • German Sharp-S ß uppercases to ‘SS’ • Expensive in complexity of implementation, speed. • Existing implementations provide simple form only.

Grapheme Clusters • Definition: what a user would consider a character, or what would display as a single character. • Multi-codepoint Clusters • Base char + combining marks • Example: decomposed form of Ň • Hangul (Korean) syllables • Unicode-enabled regular expressions should provide • Match a grapheme cluster • Test whether match position is on a boundary.

Word Boundaries, \b • Classic RE Feature • Boundaries between “word” and “non-word” characters • “Word” characters include all Alphabetic. • Non-spacing marks never separated from base, otherwise ignored. • UAX 29 Boundaries • Better, but different, results. Hello There. G’day 123.456 Classic REHello There. G’day 123.456 Unicode Word Boundaries

Unicode TR 18 • Unicode Technical Standard #18, Regular Expressions • Guidelines for how to adapt RE implementations to Unicode • Three Levels of Support, Basic, Extended, Tailored • Basic Support requires • Access to common Unicode Character Properties • Set (character class) Operations – Union, Intersection, Subtraction • Simple Unicode Loose (caseless) matching • Unicode Line separator characters • Supplementary Character support • Hex notation for Unicode code points

Unicode TR 18 • Extended Unicode Support • More properties, characters by name. (GREEK CAPITAL LETTER EPSILON) • Canonical Equivalents (normalization) • Unicode style word boundaries • Full case insensitive matching • Matching default grapheme clusters and boundaries • Tailored Support. Language or Locale specific behavior for a number of matching constructs. • No implementations available yet.

Implementations • Implementations providing significant Unicode support • Perl. • Major innovations to regular expressions • Early adopter of Unicode • Perl features and syntax widely adopted. • Java JDK 1.4 • Microsoft .NET • IBM ICU4C

Conclusion • Regular expressions provide a great way to analyze and manipulate Unicode data. • Mainstream implementations are readily available.

Questions

Unicode Text and Regular Expression

Unicode Text and Regular Expression

Presentation Transcript

Matlab Regular Expression

Regular Expression 1. What is regular expression?

Regular Expression

Regular Expression

^Regular Expression$

Regular Expression - Intro

Regular Expression

Unicode Text and Regular Expression

Regular Expression

Regular Expression

Regular Expression

Regular Expression

Regular Expression

Automata and Regular Expression

Regular Expression (1)

Regular Expression Support