390 likes | 539 Views
Perl. regular expressions: string matching. string matching. For this lecture, we focus on string matching using a if statement The format if ($str =~ /pattern to match/) # true when match if ($str !~ /patch to match/) #true when no match the same as
E N D
Perl regular expressions: string matching
string matching • For this lecture, we focus on string matching using a if statement • The format • if ($str =~ /pattern to match/) # true when match • if ($str !~ /patch to match/) #true when no match • the same as • if ($str =~ m/pattern to match/) # true when match • if ($str !~ m/patch to match/) #true when no match
simple matching • match a string or string variable • if($str =~ /dog/) • true if $str contains dog • If the $str and =~ or !~ is left off, then it uses $_ for matching
case insensitive matching • /i ignore case • if ($str =~ /dog/i) • true if $str contains dog. The match is case insensitive. • if ($str =~ /DOG/i) #same
alternation matching • | allows matching with an or • if ($str =~ /Fred|Wilma|Pebbles/) • True if contains Fred, Wilma, or Pebbles • if ($str =~/Fred|Wilma|Pebbles Flintstone/) • matches Fred, Wilma, or Pebbles Flintstone • Grouping • if ($str =~/(Fred|Wilma|Pebbles) Flintstone/) • matches Fred Flintstone, Wilma Flintstone, or Pebbles Flintstone • if ($str =~/(Blue|Song)bird/) • matches Bluebird or Songbird
alternation matching (2) • if ($str =~/th(is|at)/) • true if $str contains this or that • if ($str =~ /(p|g|m|s|b)et/) • true if $str contains: pet, get, met, set, or bet
Single character matching • Use [] • if($str =~ /[abc]/) • true if $str contains a and/or b and/or c • if ($str =~ /[pgmsb]et/) • true if $str contains for pet, get, met, set or bet • if($str =~/[Fred]/) • true if $str contains F and/or r and/or e and/or d • Not listed characters ^ character • if($str =~/[^abc]/) • true if $str does not contain a and b and c • if($str =~/[a-z]/) • true if $str contains any lower case letter a through z
Single character or'd matching (2) • if ($str =~/[0-9]/) • true if $str contains any number 0 through 9 • if ($str =~/[0-9\-]/) • matches 0 through 9 or the minus • if ($str =~/[a-z0-9\^]/) • matches any single lowercase letter or digit or ^ • if ($str =~/[a-zA-Z0-9_]/) • matches any single letter, digit, or underscore • if ($str =~/[^aeiouAEIOU]/) • matches any non-vowel in $str • if ($str !~ /[aeiouAEIOU]/) • matches only if there are no vowels in $str
matching quantifiers • multiple uses {min,max} • if ($str =~ /a{3}/) • true if $str contains aaa • common mistake • if($str =~ /Fred{3}/) • matches Freddd, not FredFredFred • if ($str =~/(Fred){3}/) • matches FredFredFred • if ($str =~/a{3,}/) • matches aaa, aaaa, aaaaa, aaaaaa, etc. • if ($str =~/a{3,5}b/) • matches aaab, aaaab, aaaaab
matching quantifiers (2) • if ($str =~/a{0,5}/) • match a, aa, aaa, aaaa, aaaaa, and if there are no a's • if ($str =~/a*/) • * match 0 or more times (max match) • if ($str =~/a*?/) • * match 0 or more times (min match) • Difference between min and max matching • $_ ="aaaa"; #matches all three above • Difference *, matches "aaaa" while *? matches "a" • max matches as many characters as it can • while min, matches as few characters as it can • This becomes important in the next lecture.
matching quantifiers (3) • + 1 or more times (max match) • +? 1 or more times (min match) • if ($str =~ /a+/) • true if there are 1 or more "a"s • ? match 0 or 1 time (max match) • ?? match 0 or 1 time (min match) • if ($str =~ /a?/) • true if there 1 a or no "a"s • Also {3,5}? min match • tries to match only 3 where possible • and {3,5} max match • tries to match 5 where possible
matching quantifiers (4) • if ($str =~ /fo+ba?r/) • matches f, 1 or more o's, b, 0 or 1 a, then an r • match: fobar, foobar, foobr • Non-match: fbar (missing o), foobaar (to many a's) • if ($str =~ /fo*ba?r/) • matches f, 0 or more o's, b, 0 or 1 a, then an r • match: fobar, fbr, fooobr, etc… • Inside [], matching quantifiers are "normal" characters. • if ($str =~/[.?!+]*/) • matches zero or more ., ?, !, or +
Exercise 7 • What will the following match? • /a+[bc]/ • /(a|be)t/i • /Hi{1,3} There\!?/ • /(Foo)?Bar/i • /[1-9][1-9][a-z]*/ • /[a-zA-Z]+, [A-Z]{2} [0-9]{5}/ • Write an regular expression for these • Match a social security number (with or without dashes) • A street address: number Name with either St, Ln, Rd or nothing. Also case insensitive
metasymbols • . match one character (except newline) • if($str =~ /./) • Always true, except when $str = "" • if ($str =~ /d.g/) • true for d and anycharacter and g • so dog, dbg, dag, dcg, d g, etc. • if ($str =~ /d.*g/) • true d and 0 or more character and g • so dg, dog, dasdfg, d g, etc. • if ($str =~ /d.+g/) • true d and 1 or more character and g • so NOT dg, but the rest dog, dasdfg, d g, etc.
metasymbols (2) • if ($str =~ /d.?g/) • true for d and any single character and g AND dg • if ($str =~ /d.{0,1}g/) • true for d and any single character and g AND dg • same as above • if ($str =~ /d.{2}g/) • true for d and 2 characters and g • so doog, dafg, dghg, etc… • if ($str =~ /d.{2,5}g/) • true for d and 2 to 5 characters and g • so dooog, doog, dXXXXXg, gXobgg, etc…
metasymbols (3) • Anchoring • ^ beginning of the string (only a not in []) • $ end of the string • if ($str =~ /^dog$/) • true only for "dog", not "ddogg" • if ($str =~ /^dog/) • true only when the string start with "dog" • so "dog", "doga", etc.
metasymbols (4) • if ($str =~ /dog$/) • true when the string ends with "dog" • "dog", "asdfadfdog", "ddddooodog" • if ($str =~ /^.$/) • true when the string is one character long and not the newline symbol • if ($str =~/^[abc]+/) • true when the string start with • "a", "aa", "aaa", etc with any characters following. • "b", "bb", "bbb", etc with any characters following. • "c", "cc", "ccc", etc with any characters following • As well as any combination of a's, b's, and c's • "abcabc", etc.
metasymbols (5) • \d match a Digit [0-9] • \D match a Nondigit [^0-9] • \s match whitespace [ \t\n\r\f] • \S match a Nonwhitespace [^ \t\n\r\f] • \w match a Word character [a-zA-Z0-9_] • \W match a Non word Character [^a-zA-Z0-9_]
Examples • if ($str =~ /\d/) #true when $str contains a digit • if ($str =~ /\d+/) #true when $str contains 1 or more digit • if ($str =~/\w\d/) #true contains a word character and 1 digit • if ($str =~/\w+\d/) #true when contains 1 or more word characters and 1 digit • true "abc1" "a1" "11" "_9" "Z8" and "a1a1" • if ($str =~/^\s\w\d/) • true when it starts with a whitespace, then a word character, and then a digit • " 11" "\ta1" "\n11" etc. • if ($str =~/^\s*\w\d/) • true when it starts with 0 or more whitespaces, then a word character, and then a digit • " 11" "11" " \t a1" etc
boundaries assertions • \b matches at any word boundary • as defined by \w and \W • \B matches at any non word boundary • as defined by \W and \w /\bis\b/ #matches "what it is" and "that is it" • can also be writing as /\Wis\W/ • won't match "tist" /\Bis\B/ #matches "thistle" and "artist" • can also be writing as /\wis\w/ • won't match "that is it"
boundaries assertions (2) /\bis\B/ #matches "istanbul" and "so—isn't that" • similar to /\Wis\w/ • but won't match "istanbul", because "is" is at the front of the string and won't match \W. • Since \w is [a-zA-Z0-9_], then all punctuation counts as a word boundary. • So /\bisn\B/ won't match "isn't", because of ' is not a Word character /\Bis\b/ #matches "this" and "this is for you" • similar to /\wis\W/ • For the second example, the match is for "this", instead of "is". • As in example above \W won't match at the end of a string.
Exercise 8 • What will the following match? • /a+\w*?/ • /\w\s*\w+/ • /\bHi\bThere/ • /\b\w+\b.+There[!]?$/i • /^\d+[a-z]*/ • /\w+,\s\w{2}\s{2}\d{5}/ • Write an regular expression for these • Rewrite #6 so the city can two or more words. • Must start with a letter, then have any number of letters and/or numbers or none at all, but end with a number
Parentheses as memory • special variables \1 .. \9 • $1 holds the first match inside a () if ($str =~ /(\d)asdf\1/) • true when has a digit, then asdf, then the same digit • examples: 1asdf1, 3asdf3 if($str =~ /(\w+)(\d+)as\2\1/) • true for a word, then digits, as, same digits, then same word • examples: "hi12as12hi" "1_31as311_"
Parentheses as memory (2) if ($str =~ /(\d)+asdf\1/) • Note: (\d)+ is different from (\d+) • (\d+) match max digits, goes into \1 • (\d)+ match a digit, but last match goes into \1 • examples: • (\d)+ on 123, \1 = 3, but the match is on 123 • So 123asdf3 would match from the top if • In the next lecture, it does some strange things on substitutions.
Parentheses as memory (3) • parentheses around parentheses • if ($str =~ /((\w+) (\w+))/) • \1, \2, \3 are bound to values $str = "Hi There"; \2 = "Hi", \3 = "There", \1="Hi There" • Perl works from the outer most parentheses to the inner, ( is 1, ((\w+) is 2, the second (\w+) is 3 • (((\w+) )(\w+)) has \1, \2, \3, \4 • 12 3 4 • \1 = "Hi There", \2 = "Hi ", \3 ="Hi", \4 = "There"
Variable Interpolation • Using variables inside in the match • $find = "abc"; • if ($str =~/$find/) • matches when $str contains the value of $find • $str = "ddogg"; • if($str =~ /\w$dog\w/) • true if $str contains the string in $dog and a word letter in front and behind.
Variable interpolation (2) • Variable interpolation sometimes does more then you think. • Example $match = “hi*”; $x = “h”; If ($x =~ /$match/) { print “$match matches $x\n”; } else { print “Failed as It should\n”; Ouptut Hi* matches h • Why? hi* doesn’t match h.
Variable interpolation (3) • Answer, The variable interpolated out as a regex, • So hi* find an h and zero or more lower case Is. • The following example leads to error • $match = “*hi*”; • If ($x =~ /$match/) { • Perl error out saying Quantifier follows nothing in regex; … • To stop meta and regex interpolation of variables • \Q … \E • Between \Q and \E matches the characters, instead of interpolating as regex.
Variable interpolation (3) $match = “hi*”; $x = “h”; if ($x =~ /\Q$match\E/) { print “$match matches $x\n”; } else { print “Failed as It should\n”; Ouptut Failed as It should.
Special Read-Only variables • We've seen \1 .. \9. There only have a value inside the match. But $1 .. $9 hold they value (same as \1 .. \9) after the match if ($str =~ /(\d+)asd\1/) { print "matched $1 \n"; } • If $str = "123asd123", then the output would be matched 123
Special Read-Only variables (2) • $` (backquote) • The characters to the left of the match • $’ (quote) • The characters to the right of the match • $& • The characters that matched • Helpful for debugging your regex.
capturing matches • $str = "a xxx c xxxxxc xxx d"; • ($a, $b) = ($str =~ m/(.+)x(.+)c/); • $a = "a xxx c xxx"; Also $1 = "a xxx c xxx"; • $b = "x"; Also $2 = "x";
match as a true value • / / returns a true/false value • returning the switch structure SWITCH: { $str =~ /abc/ && do {$a =1; last SWITCH;}; $str =~ /def/ and do {$d = 1; last SWITCH;}; $c = 1; } • Strange looking code. Also, this is one of the very few places a ; is needed after a } • NOTE either && or and could be used.
Commenting your matches • /x ignore most white space and allows comments /\w+: #Match a word and a colon ( #Begin group \s+ #match one or more spaces \w+ #match another word ) #end group \s* #match zero or more spaces \d+ #match 1 more digit /x; • same as /\w+:(\s+\w+)\s*\d+/;
Commenting your matches (2) • Be careful in comments that you don't use / otherwise perl thinks it is the end of the match • You have think about where the whitespace is in the match. • If you need to match a #, use \#
more flags for pattern matching • matching with newline in the string • //s let the . match the newline (\n) • $str = "asdf\n asdf\n"; • /(f.)/; no match • /(f.)/s; #$1 = "f\n"; • //m lets ^ and $ match next to embedded \n • $str = "af\nasdf\n"; • /(af$)/; # won't match • /(af$)/m; # $1 = "af"; • /^(as)/; #won't match • /^(as)/m; # matches, $1= "as"; • /(f.)$/ms; # matches only the last "f\n", because the . matched the \n, so it's "end of line marker".
Pattern Delimiters • if ($str =~ /\/usr\/local/) • true if $str contains /usr/local • To avoid backslashing / we can change the delimiter • choose another delimiter, which is a nonalpanumeric character, such %%, ##, {}, [] , <>, etc • must use the m in front of the match so perl knows what you want • if ($str =~ m%/usr/local%) • true if $str contains /usr/local • if ($str =~ m[/usr/local]) • true if $str contains /usr/local, but confusing since it can be mistaking for [] single character matching.
Exercise 9 • What will the following match this /\-?(\|)?m\(\d+\)\1/i • "–|m(12)|" • "|M(12)|" • "-|M(12)" • "m(12)|" • "M(12)" For /\-?(\|)?m\(\d+\)\1?/i • "|m(12)|" • "m(12)" • "-|m(12)|" • "-|m(12)"
Q A &