Regex

Regex wangyoutian@nilnul.com

char • 不可再分 • （在当前讨论范围内） • 比如： • A,b • “,”, “@” • “ ” • 汉字 • 大多可以视觉识别 • 也包括whitespace，如空格，换行，Tab

Set<char> • 有穷的字符集成为字母表（alphabet） • 一般非空 • 比如：英文字母，数字，中文和标点

String • 把字符按顺序连起来，称为string. • 一般是有限长度 • 可以是1个，两个，…，或者0个（ε, “”，空字符串） • 比如： • “abcdaaab” • 在ES中，字符串用单引号或双引号括起来。

Algebra of string • + 字符串连接 • 空字符串+其它字符串=其它字符串+空串=其它字符串

Alphabet vs string 集合列表 • 无序 • 不重复 • 势 • 有序 • 可重复 • 长度

Set<string> • 字符串的集合 • 一般为无穷 • 长度不受限制 • 也可以是有穷的 • 比如：空集, {“a”,”ab”}

Algebra of Set<string> • | • 相当于集合的并集，结果仍是Set<string> • Ф=The Union of Zero Set<string> • {“a”,”bc”,”e”} | {“a”,”1”}={“a”,”bc”,”e”,”1”} • Many use ∪, +, or ∨ for alternation

类似于 Cartesian Product 笛卡尔乘积,结果仍是Set<string> • 比如： • {“a”,”bc”,”e”} {“a”,”1”}= {“aa”,”a1”,”bca”,”bc1”,”ea”,”e1”}

乘方 • {“a”,”1”} 2 ={“a”,”1”} {“a”,”1”}={“aa”,”a1”,”1a”,”11”} • {“a”,”1”}3={“a”,”1”} {“a”,”1”} {“a”,”1”}={“aaa”,”aa1”,”a1a”,”a11”,”1aa”,”1a1”,”11a”,”111”} • 定义 • {”a”,”1”}1={“a”,”1”} • 这样{“a”,”1”} 2= {”a”,”1”}1+1= {”a”,”1”} {”a”,”1”} • {”a”,”1”}0={ε} • 这样，{”a”,”1”}1= {”a”,”1”}0+1={ε}{”a”,”1”}

乘和或复合 • S{m,n} =Sm | Sm+1 | Sm+2… | Sn • S{m,} =Sm | Sm+1 | Sm+2… • S？=S0 |S • S+ = S1 | S2 | S3… • S* =S0 | S1 | S2 | S3… • * is called Kleene Star

Priority of ops • * highest • Concatenation • alternation. • parentheses may be omitted. For example, • (ab)c can be written as abc, • and a|(b(c*)) can be written as a|bc*.

Regular Expression • Some set<string> is called regular expression, or RE, RegExp, Regex • The following are RegExp • {“a”} is regular expression, for any char in alphabet • RS is RegExp, if R and S are both Regex • R* is RegExp, if R is Regex • So {ε} is RegExp • If a Set<string> cannot be represented by above process, it’s not RegExp

Note • Ф is often included in RegExp

RegExp@JS

See Standard

Empty • Empty allowed • [] • () • |

Assertion • ^ • $ • \b • World boundary • Not _, [0-9], [A-z] • \B • Not \b • (?=expression) • (?!expression)

quantifier • ? • + • * • {m,} • {m,n} • The following will be lazy if appended by another ?

capture • () • (?: expression)

Atom Escape

\c followed by lower or upper letter • \a =a • For a is not designated special meanings • So are some other letters • \u002F • \0

Character Class • [] • [abd]={“a”,”b”,”c”} • [a-c] = {“a”,”b”,”c”} • [-ca] where – is literal • [ac-] where – is literal • [^a-c] = alphabet / [a-c]

Escape Class • [\b]={“backspace”} • [\]] = {“]”} • [\B] error • [\1] error • \1 will be the captured group

. Any char except newline • \d digit • \D not digit • \w word char • \W not word char • \s whitespace • \S not whitespace

Back reference • \1 • \0 • <NUL> • \1000000000 • Error if no such many matches.

Literal

/ … /gim • Where g for global • i for case insensitive • m for multiline

Note: • // will be taken as comments • Use /(?:)/

RegExp

RegExp. • RegExp is a function • Can construct Regular Expressions • RegExp(pattern, flags) • new RegExp(pattern, flags) • RegExp.prototype

RegExp.prototype. • constructor • exec • Return matches, an array • Ordered by the appearance of ( • There is one implicit () around the whole pattern • test • Return bool • toString • Return string

Members of RegExp instance • source • global • ignoreCase • multiline • lastIndex integer • { [[Writable]]: true, • [[Enumerable]]: false, • [[Configurable]]: false }.

<ZWNJ> and <ZWJ> are format-control characters that are used to make necessary distinctions when forming words or phrases in certain languages.

The Unicode format-control characters (i.e., the characters in category ―Cf‖ in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages). • All format control characters may be used within comments, and within string literals and regular expression literals. • In ECMAScript source text, <ZWNJ> and <ZWJ> may also be used in an identifier after the first character.

<BOM> is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <BOM> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files. <BOM> characters are treated as white space characters (see 7.2).

The special treatment of certain format-control characters outside of comments, string literals, and regular expression literals is summarised in Table 1. • Table 1 — Format-Control Character Usage Code • Unit Value Name Formal Name Usage • \u200C Zero width non-joiner <ZWNJ> IdentifierPart • \u200D Zero width joiner <ZWJ> IdentifierPart • \uFEFF Byte Order Mark <BOM> Whitespace

Regex

Regex

Presentation Transcript

Data Manipulation Regex

RegEx Parsing

regex

Formal Languages, Grammars, Regex, & Automata

JAVA RegEx

Web Scraping and Regex

Regex is Fun

Regular Expressions ( RegEx )

FSG, RegEx, and CFG

Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Strings, Regex, Web Response

Regex Challenge

Regex 101

Grammars, Regex, Problems and More

ReBug: A Regex Debugger

Semalt Expert Species The Basic Things You Should Know About Regex Scraper

Python RegEx | Python Regular Expressions Tutorial | Python Tutorial | Python Training | Edureka

JavaScript Regular Expression | JavaScript Regex | JavaScript Tutorial For Beginners | Simplilearn

What is RegEx? Regular Expression in Python & Meta Characters

Regex

Regex

Presentation Transcript

Data Manipulation Regex

RegEx Parsing

regex

Formal Languages, Grammars, Regex, &amp; Automata

JAVA RegEx

Web Scraping and Regex

Regex is Fun

Regular Expressions ( RegEx )

FSG, RegEx, and CFG

Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Strings, Regex, Web Response

Regex Challenge

Regex 101

Grammars, Regex, Problems and More

ReBug: A Regex Debugger

Semalt Expert Species The Basic Things You Should Know About Regex Scraper

Python RegEx | Python Regular Expressions Tutorial | Python Tutorial | Python Training | Edureka

JavaScript Regular Expression | JavaScript Regex | JavaScript Tutorial For Beginners | Simplilearn

What is RegEx? Regular Expression in Python & Meta Characters

Formal Languages, Grammars, Regex, & Automata