1 / 39

Programming Language Concepts Perl Regular Expressions

Programming Language Concepts Perl Regular Expressions. Adapted by Carl Reynolds from materials by Sean P. Strout. What are Regular Expressions?. Regular expressions are a tiny language used for fast, flexible, and reliable string handling Where are regex’s found?

Download Presentation

Programming Language Concepts Perl Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Language ConceptsPerl Regular Expressions Adapted by Carl Reynolds from materials by Sean P. Strout PLC - Perl Regular Expresisons (v2.0)

  2. What are Regular Expressions? • Regular expressions are a tiny language used for fast, flexible, and reliable string handling • Where are regex’s found? • Text editors like vi and emacs • Unix commands like grep, sed, awk and procmail • Web based search engines, email clients, etc. • Perl uses regexs for pattern matching, extraction and replacement on a given input string • The theory is simple: • There are an infinite number of possible text strings • A given pattern divides this set into two groups: the ones that match and the ones that don’t PLC - Perl Regular Expresisons (v2.0)

  3. What are Regular Expressions? %grep ’flint.*stone’ some_file A piece of flint, a stone which may be used to start… Found obsidian, flint, granite, and small stones of… A flintrock rifle is poor condition. The sandstone mantle… • Don’t confuse a regex with globs, which are often used for directory operations my @java_src = glob “*.java”; foreach $file (@java_src) { system “javac $file”; } PLC - Perl Regular Expresisons (v2.0)

  4. Using Simple Patterns • To compare a pattern against the contents of $_ $_ = “yabba dabba doo”; if (/abba/) { print “It matched!\n”; } • The regex is enclosed in forward slashes • Spaces inside the backslash are not ignored • The evaluation returns true if the four letter string “abba” is found in $_ • Backslash escapes are available in patterns (just like double quoted strings) /coke\tsprite/ PLC - Perl Regular Expresisons (v2.0)

  5. Metacharacters • There are a number of characters which have a special meaning in a regex • The first is the dot, . , and is known as the wildcard character, which matches any single character except a newline /bet.y/ # true = betty betsy bet=y bet.y # false = bety betsey • The second metacharacter is the backslash, \ , which makes any metacharacter nonspecial • How do I match against these strings? 3.14159 /3\.14159/ b\t /b\\t/ PLC - Perl Regular Expresisons (v2.0)

  6. Simple Quantifiers • Repetition in a pattern can be specified using the following three simplified quantifiers • The star, * , matches the preceding item zero or more times /fred\t*barney/ # fred\tbarney fred\t\tbarney fredbarney /fred.*barney/ # fred + any old junk + barney • The plus, + , matches the preceding item one or more times /fred +barney/ # fred and barney separated by only spaces • The question mark, ?, matches the preceding item zero or one times /bamm-?bamm/ # bamm-bamm or bammbamm PLC - Perl Regular Expresisons (v2.0)

  7. More Metacharacters • Use parentheses, () , for grouping a collection of items for a quantifier match /fred+/ # fred fredd freddd /(fred)+/ # fred fredfred fredfredfred /(fred)*/ # What does this match? • Use the vertical bar, | , to match either the left side or the right side /fred|barney|betty/ # fred barney betty /fred( |\t)+barney/ # “fred barney” “fred \tbarney” • Rewrite the last pattern to match only if the characters between fred and barney are all spaces or all tabs /fred( +|\t+)barney/ PLC - Perl Regular Expresisons (v2.0)

  8. Pattern Testing • Use the unix command, egrep , to test a regular expression against a file • some.txt: barney or fred fred and barney • % egrep ’fred (and|or) barney’ some.txt fred and barney • /usr/local/pub/sps/courses/plc/examples/perl/pattern_test • Write a pattern that matches any string containing: • Any string containing fred or Fred /fred|Fred/ • At least one a followed by any number of b’s /a+b*/ • At least one backslash followed by one or zero asterisks /\\+\*?/ PLC - Perl Regular Expresisons (v2.0)

  9. Character Classes • A character class is a list of possible characters inside square brackets, [] , matching any single character from within the class [abcwxyz] # matches one of those 7 chars • The minus, -, is used to specify a range of characters [a-cw-z] # same, uses hypen for range [a-zA-Z] # 52 upper/lowercase letters [\65-\90] # A-Z $_ = “The HAL-9000 requires authorization to continue.”; if (/HAL-[0-9]+/) { print “The string matches some model of HAL.\n” } PLC - Perl Regular Expresisons (v2.0)

  10. Character Classes • The caret, ^ , is used to negate a character class. This is a way to specify characters not to match against [^def] # match any char except d, e, f • Since minus is special inside a character class, it must be backslashed to make it non-special [^n\-z] # match any char except n, -, z • Write a regular expression that matches 3 letter strings: 1st letter: a, b or c 2nd letter: - 3rd letter: not a, b or c /[abc]-[^abc]/ PLC - Perl Regular Expresisons (v2.0)

  11. Character Class Shortcuts • Some character classes appear so frequently that they have shortcuts \d # a single digit, [0-9] \w # a “word” character, [A-Za-z0-9_] \s # a “space” character, [ \f\t\n\r] • Add quantifiers to beef up your pattern testing \w+ # match a series of one or more word chars \s* # any amount of whitespace (including none) \s+ # one or more whitespace characters PLC - Perl Regular Expresisons (v2.0)

  12. Negating Character Class Shortcuts • To do the opposite of a character class shortcut, use negate, ^ [^\d] # non-digit character, also \D [^\w] # non-word character, also \W [^\s] # non-space character, also \S • What do these match? 1. /[\dA-Fa-f]+/ hex number 2. /[\d\D]/ any digit or nondigit-any character, even newline 3. /[^\d\D]/ anything that’s not a digit or a nondigit, nothing 4. /{\S+, \S+, \S+}/ {1.22, 5, 8.8} PLC - Perl Regular Expresisons (v2.0)

  13. General Quantifiers • We’ve already seen the 3 simple quantifiers, *, + and ? • Use curly braces, {} , to specify the minimum and maximum repetitions allowed /a{5,15}/ # match from 5 to 15 a’s /(fred){3,}/ # match 3 or more fred in a row /\w{8}/ # match exactly 8 word chars • The simple quantifiers, therefore, are just shortcuts for this general form. Can you come up with them? PLC - Perl Regular Expresisons (v2.0)

  14. Anchors • The caret anchor, ^ , marks the beginning of a string • Yes, you’ve seen this used before – inside character classes for negation. The meaning is different outside a character class • The dollar sign, $, marks the end of string (as well as a trailing newline) /^fred/ # matches fred only at start of string /rock$/ # matches rock only at end of string /^fred$/ # only matches fred, alone as a string /^\s*$/ # What does this match? Empty line PLC - Perl Regular Expresisons (v2.0)

  15. Word Anchors • The word boundary, \b , matches on either end of a word. /\bfred\b/ # match the word fred # but not frederick, alfred or manfred mann /\bhunt/ # matches hunt, hunting or hunter # but not shunt /stone\b/ # matches sandstone, flintstone # but not capstones PLC - Perl Regular Expresisons (v2.0)

  16. Word Anchors • The word boundary matches at the start or end of a group of \w characters. Quotes and apostrophes don’t change the word boundaries • The non-word boundary, \B , matches where \b would not /\bsearch\B/ # match searches, searching, searched # but not search or researching PLC - Perl Regular Expresisons (v2.0)

  17. Memory Parentheses • The parentheses are also used for remembering the substring matched by the pattern • Backreferences refer back to an earlier saved memory within the same regex /(.)\1/ # match any char, followed by same char • Each pair of parentheses equals one regex memory • Ordering is outer to inner, left to right /((a|b) (c)) \1/ # matches a c a c or b c b c /((a|b) (c)) \2/ # matches a c a or b c b /((a|b) (c)) \3/ # matches a c c or b c c • Soon we will see how we can extract these memories into memory variables for future use PLC - Perl Regular Expresisons (v2.0)

  18. Precedence • There are four levels of precedence from high to low • Parentheses - used for grouping and memory • Quantifiers - the repeat operators • Anchors and Sequence - string and word boundaries, and sequences of items • Alternation – cuts pattern into two pieces /^fred|barney$/ # fred at start or barney at end /^(fred|barney)$/ # fred or barney at start/end /(wilma|pebbles?)/ # wilma, pebbles, or pebble # (as part of a string) /^(\w+)\s+(\w+)$/ # What does this match? PLC - Perl Regular Expresisons (v2.0)

  19. Regex Cheat Sheet PLC - Perl Regular Expresisons (v2.0)

  20. Regex Cheat Sheet PLC - Perl Regular Expresisons (v2.0)

  21. Testing Your Understanding • Use egrep and write the regular expressions to match on the dictionary file: ~jdb/public_html/plc/perlLab/dictionary.txt • Words starting with a • Words that start with a or A • Words with exactly four letters • Words with exactly four letters and begins with a • Words with four or more letters • Words that contain a capital letter • Words that start with non-capital and include a capital • Words with more than one capital • Words the neither begin nor end with a • Words that begin and end with a vowel • Words that neither begin nor end with a vowel • Words that contain the vowel sequence a-e-i-o-u • Words that contain your initials in order • Words that contain only the letters of your first name PLC - Perl Regular Expresisons (v2.0)

  22. Matches with m// • Evaluating a regex, /fred/ , is actually a shortcut for the pattern match operator, m/fred/ • Any pair of delimeters may be used to quote the contents • To implicitly refer to the match operator, you must use the forward slash • Matching a URL might be easier with a different delimiter m%^http://% PLC - Perl Regular Expresisons (v2.0)

  23. Option Modifiers • Option modifiers are activated by appending a flag to the closing delimiter • Case-insensitive matching is achieved with /i print “Would you like to play a game? “; chomp($_ = <STDIN>) { if (/\byes\b/i) { print “1. Global Thermonuclear War”; … } • The dot matches any character (including newline) with /s $_ = “I saw Barney\nbowling\nwith Fred\nlast night.\n”; if (/\bBarney\b.*\bFred\b/s) { print “The string mentions Fred after Barney!\n”; } PLC - Perl Regular Expresisons (v2.0)

  24. Combining Option Modifiers • Concatenate flags together to apply multiple options to the same pattern if (/\bbarney\b.*\bfred\b/si) { # both /s and /i print “The string mentions Fred after Barney!\n”; } • Use man perlop to get the complete list of options PLC - Perl Regular Expresisons (v2.0)

  25. The Binding Operator • Matching against $_ is the default. To match against an arbitrary string on the left, use=~ my $some_other = “I dream of betty rubble.”; if ($some_other =~ /\brub/) { print “Aye, there’s the rub.\n”; } • This is not an assignment, it is an evaluation that returns the result of the test my $likes_perl = <STDIN> =~ /\byes\b/i; if ($likes_perl) { print “Why don’t you marry it?”; } PLC - Perl Regular Expresisons (v2.0)

  26. Interpolating into Patterns • A regex is double quote interpolated #!/usr/bin/perl –w my $what = “larry”; while (<>) { if (/^($what)/) { # pattern anchored at start print “Saw $what at beginning of $_”; } } PLC - Perl Regular Expresisons (v2.0)

  27. Match Variables • When using memories, (parens), the results are stored into scalar variables with names like $1 and $2 • These variables coincide with the order of the parentheses pairs in the regex and allow us to pull out parts of the string $_ = “Hello there, neighbor”; if (/\s(\w+),/) { # memorize word between space and comma print “the word was $1\n”; # got the word “there” } if (/(\S+) (\S+), (\S+)/) { print “words were $1 $2 $3\n”; # $1=Hello } # $2=there # $3=neighbor PLC - Perl Regular Expresisons (v2.0)

  28. Match Variables • Continually match an expression against a string to find multiple occurrences of the same pattern $_ = "word1 word2 word3 word4"; while (/(\S+)(.*)/) { print "$1\n"; $_ = $2; } • A memory variable may contain an empty string $dino = “A million years.”; if ($dino =~ /(\d*) years/) { # $1=(), #2=undef, #3=undef… } PLC - Perl Regular Expresisons (v2.0)

  29. The Persistence of Memory • The match variables generally stay around until the next successful pattern match • An unsuccessful match leaves the previous match intact $wilma =~ /(\w+)/; # match fails print “Wilma’s word was $1… or was it?\n”; • For maintainability, don’t use a match variable more than a few lines after the pattern match. Copy it into an ordinary variable if ($wilma =~ /(\w+)/ { my $wilma_word = $1; } PLC - Perl Regular Expresisons (v2.0)

  30. Automatic Match Variables • There are three more free variables from a pattern match • Whatever came before the match is in $` • Whatever matched is stored in $& • Whatever came after the match is in $’ if (“Hello there, neighbor” =~ /\s(\w+),/) { print “That was ($`)($&)($’).\n”; } # (Hello)( there,)(neighbor) # $1=there (note the difference between $1 and $&) • Revisit the pattern_test program PLC - Perl Regular Expresisons (v2.0)

  31. Substitution with s/// • Replace whatever part of a variable matches a pattern with a replacement string $_ = “Hey, Barney!”; s/Barney/Fred/; print “$_\n”; # Hey, Fred! • If the match fails, nothing happens and the variable is untouched s/Wilma/Betty/; # $_=“Hey, Fred!” • Memory variables can also be used s/Hey, (\w+)/Yo, $1/; # $_=“Yo, Fred!” PLC - Perl Regular Expresisons (v2.0)

  32. Substitution with s/// • Follow through with this series of substitutions $_ = “green scaly dino”; s/(\w+) (\w+)/$2, $1/; # “scaly, green dino” s/^/huge, /; # “huge, scaly, green dino” s/,.*een//; # “huge dino” (greedy match) s/green/red; # false: “huge dino” s/\w+$/($`!)$&/; # “huge (huge !)dino” s/\s+(!\W+)/$1 /; # “huge (huge!) dino” s/huge/gigantic; # “gigantic (huge!) dino” • The return value is true if the substitution was successful PLC - Perl Regular Expresisons (v2.0)

  33. Global Replacement with /g • The /g modifier tells s/// to make all possible nonoverlapping replacements $_ = “home, sweet home!”; s/home/cave/g; # $_=“cave, sweet cave!” • Collapsing whitespace $_ = “Input data\t may have extra whitespace.”; s/\s+/ /g; # “Input data may have extra whitespace.” • Stripping whitespace s/^\s+//; # Replace leading whitespace with nothing s/\s+$//; # Replace trailing whitespace with nothing PLC - Perl Regular Expresisons (v2.0)

  34. The Binding Operator and Case Shifting • Choose a different target for s/// by using the binding operator $file = “/usr/bin/perl”; $file =~ s#^.*/##s; # $file=perl # (s means dot matches any character) • Use \U and \L to force what follows to uppercase and lowercase $_ = “I saw Barney with Fred.”; s/(fred|barney)/\U$1/gi; # “I saw BARNEY with FRED.” s/(fred|barney)/\L$1/gi; # “I saw barney with fred.” • Turn off case shifting with \E PLC - Perl Regular Expresisons (v2.0)

  35. The split Operator • Use split to break a string according to a separator @fields = split /:/, “abc:def:g:h”; #gives (“abc”, “def”, “g”, “h”) @fields = split /:/, “abc:def::g:h”; #gives (“abc”, “def”, “”, “g”, “h”) • Leading empty fields are returned but trailers are not @fields = split /:/, “:::a:b:c:::”; #gives (“”, “”, “”, “a”, “b”, “c”) PLC - Perl Regular Expresisons (v2.0)

  36. The split Operator • Whitespace runs are equivalent to a single space my $some_input = “This is a \t test.\n”; my @args = split /\s+/, $some_input; #(“This”, “is”, “a”, “test.”) • The default for split is to break up $_ on whitespace my @fields = split; # like: split /\s+/, $_; PLC - Perl Regular Expresisons (v2.0)

  37. The join Function • The join function doesn’t use patterns, but it is considered the opposite of split my $x = join “:”, 4, 6, 8, 10, 12; # $x=4:6:8:10:12 my $y = join “foo”, “bar” # gives just “bar” (foo not needed as glue) my @empty; my $empty = join “baz”, @empty; # no items, it’s an empty string my @values = split /:/, $x; # @values=(4,6,8,10,12) my $z = join “-”, @values; # $z=“4-6-8-10-12” PLC - Perl Regular Expresisons (v2.0)

  38. Testing Your Understanding • Make a pattern that will match three consecutive copies of whatever is currently contained in $what. That is, if $what is fred, your pattern should match fredfredfred. If $what is fred|barney, your pattern should match fredfredbarney or barneyfredfred or barneybarneybarney or many other variations. PLC - Perl Regular Expresisons (v2.0)

  39. Revision History • Revision History • v1.00, 10/19/2003 11:27 PM, sps Initial revision. • v1.01, 1/26/2004 3:12 PM, sps 20032 updates. -- v2.0, 1/17/2005, chr PLC - Perl Regular Expresisons (v2.0)

More Related