100 likes | 200 Views
12. Regular Expressions. Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned, sentiment is my forte. I keep science for life. - Oscar Wilde. Concepts. Regular Expressions
E N D
Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned, sentiment is my forte. I keep science for life. - Oscar Wilde
Concepts • Regular Expressions • allows to search for a pattern within a text string • the patterns can be rather complex • same idea as "wildcard" characters – compare SQL – but much more expressive • often abbreviated, e.g. as RegExp • RegExps match as much as possible • they are greedy • Theoretical underpinnings • nondeterministic final automata (NFA) • regular grammars • but some constructs extend the functionality further • even beyond CFG (context-free grammars)
Support • Popular, widely supported • Directly in scripting languages • JavaScript • special syntax • PHP • functions • Ruby • Perl • as libraries • Java's java.lang.regex package
JavaScript RegExp • Directly as argument of methods of String object • string.match(regexp) • returns an array of substrings that matched regexp pattern • string.replace(regexp,by) • returns a new string where the first (or all) matched patterns were replaced with by string • string.search(regexp) • returns the index of first substring that matched regexp pattern, -1 if there is no match • string.split(regexp) • returns an array of the substrings of string separated by regexp • regexp argument • enclosed in / • e.g., /ex/ matches first occurrence of "ex" • optional modifiers placed as suffix • g (global); used in replace() • e.g., /ex/g matches all occurrences of "ex" • i (ignore case) • e.g., /ex/i matches all occurrences of "ex", "EX", "Ex" and "eX" • m (multiline)
PHP RegExp • functions with $regexp and $string arguments • ereg($regexp,$string [,&$matches]) • returns length of matched string, false if there is no match • array reference &$matches if given, will be filled with the string in $matches[0]and the matched substrings in subsequent elements • ereg_replace($regexp,$by,$string) • returns a string where the first (or all) matched patterns were replaced with $by string • split($regexp,$string [,$limit]) • returns an array of substrings of $string that were separated by patterns matching $regexp • optional $limit determines how many substrings to return (the last one contains the remainder) • eregi(), eregi_replace(), spliti() • same as ereg() and ereg_replace(), but ignores case • preg_match($regexp,$string ) • similar to ereg(), see PHP documentation • if global search for all matches is to be performed, ereg() or ereg_replace() must be called in a loop
Syntax in JavaScript • by "element" we mean a character or a group • . any character • ? one occurrences of preceding element or nothing • * any number of occurrences of preceding element, incl. none • e.g., a.*z matches the largest substring that starts with a and ends with z, incl. "az" • + any number of occurrences of preceding element, but at least one • e.g., a.+z matches the largest substring that starts with a and ends with z, not including "az" • note that "azz" and "aaz" are matched • {n} exactly n occurrences of preceding element • {m,n} between n and m occurrences of preceding element • ^ beginning of the string • $ end of the string • sequence of elements means that such sequence must be matched • e.g., a.z matches "axz", "a5z", "aQz", etc. • [] alternative elements • e.g., [ab] means a or b • [^ ] none of the alternative elements • e.g., [^ab] means not a and not b • - range • e.g., [a-zA-Z] means a through z or A through Z, i.e. all lower-case and upper-case letters • | or • e.g., ab|yz matches "ab" and "yz"
Special Characters • Denoted by \ • \/: / • \b: space/blank • \t: tab character • \n: line feed • \r: carriage return • \f: form feed • \s: whitespace character, i.e.[ \t\r\n] • \d: digit, i.e.[0-9] • \w: word character, i.e.[a-zA-Z0-9_] • \S: not a whitespace character, i.e.[^\s] • \D: not a digit, i.e.[^\d] • \W: not a word character, i.e.[^\] • any other character preceded by \ means the character itself • the "meta-characters" need to be escaped: • \\, \/, \[, \], \., \?, \[, \], \|, \+, \*, \(, \), \^, \$, \-, \{, \}
RegExp Capturing • If you enclose subpattern(s) ( and ) within a RegExp it the pattern(s) that will be captured, i.e. returned or used • e.g., \b(.*)@ will capture the first part of an email
Sample RegExp • hex digit: • [0-9a-fA-F] • identifier: • [a-zA-Z_][a-zA-Z_0-9]* • email address: • \b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b