1 / 39

Perl

Perl. regular expressions: string matching. string matching. For this lecture, we focus on string matching using a if statement The format if ($str =~ /pattern to match/) # true when match if ($str !~ /patch to match/) #true when no match the same as

snorris
Download Presentation

Perl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perl regular expressions: string matching

  2. string matching • For this lecture, we focus on string matching using a if statement • The format • if ($str =~ /pattern to match/) # true when match • if ($str !~ /patch to match/) #true when no match • the same as • if ($str =~ m/pattern to match/) # true when match • if ($str !~ m/patch to match/) #true when no match

  3. simple matching • match a string or string variable • if($str =~ /dog/) • true if $str contains dog • If the $str and =~ or !~ is left off, then it uses $_ for matching

  4. case insensitive matching • /i ignore case • if ($str =~ /dog/i) • true if $str contains dog. The match is case insensitive. • if ($str =~ /DOG/i) #same

  5. alternation matching • | allows matching with an or • if ($str =~ /Fred|Wilma|Pebbles/) • True if contains Fred, Wilma, or Pebbles • if ($str =~/Fred|Wilma|Pebbles Flintstone/) • matches Fred, Wilma, or Pebbles Flintstone • Grouping • if ($str =~/(Fred|Wilma|Pebbles) Flintstone/) • matches Fred Flintstone, Wilma Flintstone, or Pebbles Flintstone • if ($str =~/(Blue|Song)bird/) • matches Bluebird or Songbird

  6. alternation matching (2) • if ($str =~/th(is|at)/) • true if $str contains this or that • if ($str =~ /(p|g|m|s|b)et/) • true if $str contains: pet, get, met, set, or bet

  7. Single character matching • Use [] • if($str =~ /[abc]/) • true if $str contains a and/or b and/or c • if ($str =~ /[pgmsb]et/) • true if $str contains for pet, get, met, set or bet • if($str =~/[Fred]/) • true if $str contains F and/or r and/or e and/or d • Not listed characters ^ character • if($str =~/[^abc]/) • true if $str does not contain a and b and c • if($str =~/[a-z]/) • true if $str contains any lower case letter a through z

  8. Single character or'd matching (2) • if ($str =~/[0-9]/) • true if $str contains any number 0 through 9 • if ($str =~/[0-9\-]/) • matches 0 through 9 or the minus • if ($str =~/[a-z0-9\^]/) • matches any single lowercase letter or digit or ^ • if ($str =~/[a-zA-Z0-9_]/) • matches any single letter, digit, or underscore • if ($str =~/[^aeiouAEIOU]/) • matches any non-vowel in $str • if ($str !~ /[aeiouAEIOU]/) • matches only if there are no vowels in $str

  9. matching quantifiers • multiple uses {min,max} • if ($str =~ /a{3}/) • true if $str contains aaa • common mistake • if($str =~ /Fred{3}/) • matches Freddd, not FredFredFred • if ($str =~/(Fred){3}/) • matches FredFredFred • if ($str =~/a{3,}/) • matches aaa, aaaa, aaaaa, aaaaaa, etc. • if ($str =~/a{3,5}b/) • matches aaab, aaaab, aaaaab

  10. matching quantifiers (2) • if ($str =~/a{0,5}/) • match a, aa, aaa, aaaa, aaaaa, and if there are no a's • if ($str =~/a*/) • * match 0 or more times (max match) • if ($str =~/a*?/) • * match 0 or more times (min match) • Difference between min and max matching • $_ ="aaaa"; #matches all three above • Difference *, matches "aaaa" while *? matches "a" • max matches as many characters as it can • while min, matches as few characters as it can • This becomes important in the next lecture.

  11. matching quantifiers (3) • + 1 or more times (max match) • +? 1 or more times (min match) • if ($str =~ /a+/) • true if there are 1 or more "a"s • ? match 0 or 1 time (max match) • ?? match 0 or 1 time (min match) • if ($str =~ /a?/) • true if there 1 a or no "a"s • Also {3,5}? min match • tries to match only 3 where possible • and {3,5} max match • tries to match 5 where possible

  12. matching quantifiers (4) • if ($str =~ /fo+ba?r/) • matches f, 1 or more o's, b, 0 or 1 a, then an r • match: fobar, foobar, foobr • Non-match: fbar (missing o), foobaar (to many a's) • if ($str =~ /fo*ba?r/) • matches f, 0 or more o's, b, 0 or 1 a, then an r • match: fobar, fbr, fooobr, etc… • Inside [], matching quantifiers are "normal" characters. • if ($str =~/[.?!+]*/) • matches zero or more ., ?, !, or +

  13. Exercise 7 • What will the following match? • /a+[bc]/ • /(a|be)t/i • /Hi{1,3} There\!?/ • /(Foo)?Bar/i • /[1-9][1-9][a-z]*/ • /[a-zA-Z]+, [A-Z]{2} [0-9]{5}/ • Write an regular expression for these • Match a social security number (with or without dashes) • A street address: number Name with either St, Ln, Rd or nothing. Also case insensitive

  14. metasymbols • . match one character (except newline) • if($str =~ /./) • Always true, except when $str = "" • if ($str =~ /d.g/) • true for d and anycharacter and g • so dog, dbg, dag, dcg, d g, etc. • if ($str =~ /d.*g/) • true d and 0 or more character and g • so dg, dog, dasdfg, d g, etc. • if ($str =~ /d.+g/) • true d and 1 or more character and g • so NOT dg, but the rest dog, dasdfg, d g, etc.

  15. metasymbols (2) • if ($str =~ /d.?g/) • true for d and any single character and g AND dg • if ($str =~ /d.{0,1}g/) • true for d and any single character and g AND dg • same as above • if ($str =~ /d.{2}g/) • true for d and 2 characters and g • so doog, dafg, dghg, etc… • if ($str =~ /d.{2,5}g/) • true for d and 2 to 5 characters and g • so dooog, doog, dXXXXXg, gXobgg, etc…

  16. metasymbols (3) • Anchoring • ^ beginning of the string (only a not in []) • $ end of the string • if ($str =~ /^dog$/) • true only for "dog", not "ddogg" • if ($str =~ /^dog/) • true only when the string start with "dog" • so "dog", "doga", etc.

  17. metasymbols (4) • if ($str =~ /dog$/) • true when the string ends with "dog" • "dog", "asdfadfdog", "ddddooodog" • if ($str =~ /^.$/) • true when the string is one character long and not the newline symbol • if ($str =~/^[abc]+/) • true when the string start with • "a", "aa", "aaa", etc with any characters following. • "b", "bb", "bbb", etc with any characters following. • "c", "cc", "ccc", etc with any characters following • As well as any combination of a's, b's, and c's • "abcabc", etc.

  18. metasymbols (5) • \d match a Digit [0-9] • \D match a Nondigit [^0-9] • \s match whitespace [ \t\n\r\f] • \S match a Nonwhitespace [^ \t\n\r\f] • \w match a Word character [a-zA-Z0-9_] • \W match a Non word Character [^a-zA-Z0-9_]

  19. Examples • if ($str =~ /\d/) #true when $str contains a digit • if ($str =~ /\d+/) #true when $str contains 1 or more digit • if ($str =~/\w\d/) #true contains a word character and 1 digit • if ($str =~/\w+\d/) #true when contains 1 or more word characters and 1 digit • true "abc1" "a1" "11" "_9" "Z8" and "a1a1" • if ($str =~/^\s\w\d/) • true when it starts with a whitespace, then a word character, and then a digit • " 11" "\ta1" "\n11" etc. • if ($str =~/^\s*\w\d/) • true when it starts with 0 or more whitespaces, then a word character, and then a digit • " 11" "11" " \t a1" etc

  20. boundaries assertions • \b matches at any word boundary • as defined by \w and \W • \B matches at any non word boundary • as defined by \W and \w /\bis\b/ #matches "what it is" and "that is it" • can also be writing as /\Wis\W/ • won't match "tist" /\Bis\B/ #matches "thistle" and "artist" • can also be writing as /\wis\w/ • won't match "that is it"

  21. boundaries assertions (2) /\bis\B/ #matches "istanbul" and "so—isn't that" • similar to /\Wis\w/ • but won't match "istanbul", because "is" is at the front of the string and won't match \W. • Since \w is [a-zA-Z0-9_], then all punctuation counts as a word boundary. • So /\bisn\B/ won't match "isn't", because of ' is not a Word character /\Bis\b/ #matches "this" and "this is for you" • similar to /\wis\W/ • For the second example, the match is for "this", instead of "is". • As in example above \W won't match at the end of a string.

  22. Exercise 8 • What will the following match? • /a+\w*?/ • /\w\s*\w+/ • /\bHi\bThere/ • /\b\w+\b.+There[!]?$/i • /^\d+[a-z]*/ • /\w+,\s\w{2}\s{2}\d{5}/ • Write an regular expression for these • Rewrite #6 so the city can two or more words. • Must start with a letter, then have any number of letters and/or numbers or none at all, but end with a number

  23. Parentheses as memory • special variables \1 .. \9 • $1 holds the first match inside a () if ($str =~ /(\d)asdf\1/) • true when has a digit, then asdf, then the same digit • examples: 1asdf1, 3asdf3 if($str =~ /(\w+)(\d+)as\2\1/) • true for a word, then digits, as, same digits, then same word • examples: "hi12as12hi" "1_31as311_"

  24. Parentheses as memory (2) if ($str =~ /(\d)+asdf\1/) • Note: (\d)+ is different from (\d+) • (\d+) match max digits, goes into \1 • (\d)+ match a digit, but last match goes into \1 • examples: • (\d)+ on 123, \1 = 3, but the match is on 123 • So 123asdf3 would match from the top if • In the next lecture, it does some strange things on substitutions.

  25. Parentheses as memory (3) • parentheses around parentheses • if ($str =~ /((\w+) (\w+))/) • \1, \2, \3 are bound to values $str = "Hi There"; \2 = "Hi", \3 = "There", \1="Hi There" • Perl works from the outer most parentheses to the inner, ( is 1, ((\w+) is 2, the second (\w+) is 3 • (((\w+) )(\w+)) has \1, \2, \3, \4 • 12 3 4 • \1 = "Hi There", \2 = "Hi ", \3 ="Hi", \4 = "There"

  26. Variable Interpolation • Using variables inside in the match • $find = "abc"; • if ($str =~/$find/) • matches when $str contains the value of $find • $str = "ddogg"; • if($str =~ /\w$dog\w/) • true if $str contains the string in $dog and a word letter in front and behind.

  27. Variable interpolation (2) • Variable interpolation sometimes does more then you think. • Example $match = “hi*”; $x = “h”; If ($x =~ /$match/) { print “$match matches $x\n”; } else { print “Failed as It should\n”; Ouptut Hi* matches h • Why? hi* doesn’t match h.

  28. Variable interpolation (3) • Answer, The variable interpolated out as a regex, • So hi*  find an h and zero or more lower case Is. • The following example leads to error • $match = “*hi*”; • If ($x =~ /$match/) { • Perl error out saying Quantifier follows nothing in regex; … • To stop meta and regex interpolation of variables • \Q … \E • Between \Q and \E matches the characters, instead of interpolating as regex.

  29. Variable interpolation (3) $match = “hi*”; $x = “h”; if ($x =~ /\Q$match\E/) { print “$match matches $x\n”; } else { print “Failed as It should\n”; Ouptut Failed as It should.

  30. Special Read-Only variables • We've seen \1 .. \9. There only have a value inside the match. But $1 .. $9 hold they value (same as \1 .. \9) after the match if ($str =~ /(\d+)asd\1/) { print "matched $1 \n"; } • If $str = "123asd123", then the output would be matched 123

  31. Special Read-Only variables (2) • $` (backquote) • The characters to the left of the match • $’ (quote) • The characters to the right of the match • $& • The characters that matched • Helpful for debugging your regex.

  32. capturing matches • $str = "a xxx c xxxxxc xxx d"; • ($a, $b) = ($str =~ m/(.+)x(.+)c/); • $a = "a xxx c xxx"; Also $1 = "a xxx c xxx"; • $b = "x"; Also $2 = "x";

  33. match as a true value • / / returns a true/false value • returning the switch structure SWITCH: { $str =~ /abc/ && do {$a =1; last SWITCH;}; $str =~ /def/ and do {$d = 1; last SWITCH;}; $c = 1; } • Strange looking code. Also, this is one of the very few places a ; is needed after a } • NOTE either && or and could be used.

  34. Commenting your matches • /x ignore most white space and allows comments /\w+: #Match a word and a colon ( #Begin group \s+ #match one or more spaces \w+ #match another word ) #end group \s* #match zero or more spaces \d+ #match 1 more digit /x; • same as /\w+:(\s+\w+)\s*\d+/;

  35. Commenting your matches (2) • Be careful in comments that you don't use / otherwise perl thinks it is the end of the match • You have think about where the whitespace is in the match. • If you need to match a #, use \#

  36. more flags for pattern matching • matching with newline in the string • //s let the . match the newline (\n) • $str = "asdf\n asdf\n"; • /(f.)/; no match • /(f.)/s; #$1 = "f\n"; • //m lets ^ and $ match next to embedded \n • $str = "af\nasdf\n"; • /(af$)/; # won't match • /(af$)/m; # $1 = "af"; • /^(as)/; #won't match • /^(as)/m; # matches, $1= "as"; • /(f.)$/ms; # matches only the last "f\n", because the . matched the \n, so it's "end of line marker".

  37. Pattern Delimiters • if ($str =~ /\/usr\/local/) • true if $str contains /usr/local • To avoid backslashing / we can change the delimiter • choose another delimiter, which is a nonalpanumeric character, such %%, ##, {}, [] , <>, etc • must use the m in front of the match so perl knows what you want • if ($str =~ m%/usr/local%) • true if $str contains /usr/local • if ($str =~ m[/usr/local]) • true if $str contains /usr/local, but confusing since it can be mistaking for [] single character matching.

  38. Exercise 9 • What will the following match this /\-?(\|)?m\(\d+\)\1/i • "–|m(12)|" • "|M(12)|" • "-|M(12)" • "m(12)|" • "M(12)" For /\-?(\|)?m\(\d+\)\1?/i • "|m(12)|" • "m(12)" • "-|m(12)|" • "-|m(12)"

  39. Q A &

More Related