380 likes | 723 Views
Regex. wangyoutian@nilnul.com. char. 不可再分 (在当前讨论范围内) 比如: A,b “,”, “@” “ ” 汉字 大多可以视觉识别 也包括 whitespace ,如空格,换行, Tab. Set<char>. 有穷 的字符集成为字母表( alphabet ) 一般非空 比如:英文字母,数字,中文和标点. String. 把字符按顺序连起来,称为 string. 一般是有限长度 可以是 1 个,两个, … ,或者 0 个( ε , “” ,空字符串) 比如: “abcdaaab”
E N D
Regex wangyoutian@nilnul.com
char • 不可再分 • (在当前讨论范围内) • 比如: • A,b • “,”, “@” • “ ” • 汉字 • 大多可以视觉识别 • 也包括whitespace,如空格,换行,Tab
Set<char> • 有穷的字符集成为字母表(alphabet) • 一般非空 • 比如:英文字母,数字,中文和标点
String • 把字符按顺序连起来,称为string. • 一般是有限长度 • 可以是1个,两个,…,或者0个(ε, “”,空字符串) • 比如: • “abcdaaab” • 在ES中,字符串用单引号或双引号括起来。
Algebra of string • + 字符串连接 • 空字符串+其它字符串=其它字符串+空串=其它字符串
Alphabet vs string 集合 列表 • 无序 • 不重复 • 势 • 有序 • 可重复 • 长度
Set<string> • 字符串的集合 • 一般为无穷 • 长度不受限制 • 也可以是有穷的 • 比如:空集, {“a”,”ab”}
Algebra of Set<string> • | • 相当于集合的并集,结果仍是Set<string> • Ф=The Union of Zero Set<string> • {“a”,”bc”,”e”} | {“a”,”1”}={“a”,”bc”,”e”,”1”} • Many use ∪, +, or ∨ for alternation
类似于 Cartesian Product 笛卡尔乘积,结果仍是Set<string> • 比如: • {“a”,”bc”,”e”} {“a”,”1”}= {“aa”,”a1”,”bca”,”bc1”,”ea”,”e1”}
乘方 • {“a”,”1”} 2 ={“a”,”1”} {“a”,”1”}={“aa”,”a1”,”1a”,”11”} • {“a”,”1”}3={“a”,”1”} {“a”,”1”} {“a”,”1”}={“aaa”,”aa1”,”a1a”,”a11”,”1aa”,”1a1”,”11a”,”111”} • 定义 • {”a”,”1”}1={“a”,”1”} • 这样{“a”,”1”} 2= {”a”,”1”}1+1= {”a”,”1”} {”a”,”1”} • {”a”,”1”}0={ε} • 这样,{”a”,”1”}1= {”a”,”1”}0+1={ε}{”a”,”1”}
乘和或 复合 • S{m,n} =Sm | Sm+1 | Sm+2… | Sn • S{m,} =Sm | Sm+1 | Sm+2… • S?=S0 |S • S+ = S1 | S2 | S3… • S* =S0 | S1 | S2 | S3… • * is called Kleene Star
Priority of ops • * highest • Concatenation • alternation. • parentheses may be omitted. For example, • (ab)c can be written as abc, • and a|(b(c*)) can be written as a|bc*.
Regular Expression • Some set<string> is called regular expression, or RE, RegExp, Regex • The following are RegExp • {“a”} is regular expression, for any char in alphabet • RS is RegExp, if R and S are both Regex • R* is RegExp, if R is Regex • So {ε} is RegExp • If a Set<string> cannot be represented by above process, it’s not RegExp
Note • Ф is often included in RegExp
Empty • Empty allowed • [] • () • |
Assertion • ^ • $ • \b • World boundary • Not _, [0-9], [A-z] • \B • Not \b • (?=expression) • (?!expression)
quantifier • ? • + • * • {m,} • {m,n} • The following will be lazy if appended by another ?
capture • () • (?: expression)
\c followed by lower or upper letter • \a =a • For a is not designated special meanings • So are some other letters • \u002F • \0
Character Class • [] • [abd]={“a”,”b”,”c”} • [a-c] = {“a”,”b”,”c”} • [-ca] where – is literal • [ac-] where – is literal • [^a-c] = alphabet / [a-c]
Escape Class • [\b]={“backspace”} • [\]] = {“]”} • [\B] error • [\1] error • \1 will be the captured group
. Any char except newline • \d digit • \D not digit • \w word char • \W not word char • \s whitespace • \S not whitespace
Back reference • \1 • \0 • <NUL> • \1000000000 • Error if no such many matches.
/ … /gim • Where g for global • i for case insensitive • m for multiline
Note: • // will be taken as comments • Use /(?:)/
RegExp. • RegExp is a function • Can construct Regular Expressions • RegExp(pattern, flags) • new RegExp(pattern, flags) • RegExp.prototype
RegExp.prototype. • constructor • exec • Return matches, an array • Ordered by the appearance of ( • There is one implicit () around the whole pattern • test • Return bool • toString • Return string
Members of RegExp instance • source • global • ignoreCase • multiline • lastIndex integer • { [[Writable]]: true, • [[Enumerable]]: false, • [[Configurable]]: false }.
<ZWNJ> and <ZWJ> are format-control characters that are used to make necessary distinctions when forming words or phrases in certain languages.
The Unicode format-control characters (i.e., the characters in category ―Cf‖ in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages). • All format control characters may be used within comments, and within string literals and regular expression literals. • In ECMAScript source text, <ZWNJ> and <ZWJ> may also be used in an identifier after the first character.
<BOM> is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <BOM> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files. <BOM> characters are treated as white space characters (see 7.2).
The special treatment of certain format-control characters outside of comments, string literals, and regular expression literals is summarised in Table 1. • Table 1 — Format-Control Character Usage Code • Unit Value Name Formal Name Usage • \u200C Zero width non-joiner <ZWNJ> IdentifierPart • \u200D Zero width joiner <ZWJ> IdentifierPart • \uFEFF Byte Order Mark <BOM> Whitespace