400 likes | 427 Views
Learn the fundamentals and advanced concepts of regular expressions in Perl for efficient string handling. Discover metacharacters, quantifiers, pattern testing, and more.
E N D
Programming Language ConceptsPerl Regular Expressions Adapted by Carl Reynolds from materials by Sean P. Strout PLC - Perl Regular Expresisons (v2.0)
What are Regular Expressions? • Regular expressions are a tiny language used for fast, flexible, and reliable string handling • Where are regex’s found? • Text editors like vi and emacs • Unix commands like grep, sed, awk and procmail • Web based search engines, email clients, etc. • Perl uses regexs for pattern matching, extraction and replacement on a given input string • The theory is simple: • There are an infinite number of possible text strings • A given pattern divides this set into two groups: the ones that match and the ones that don’t PLC - Perl Regular Expresisons (v2.0)
What are Regular Expressions? %grep ’flint.*stone’ some_file A piece of flint, a stone which may be used to start… Found obsidian, flint, granite, and small stones of… A flintrock rifle is poor condition. The sandstone mantle… • Don’t confuse a regex with globs, which are often used for directory operations my @java_src = glob “*.java”; foreach $file (@java_src) { system “javac $file”; } PLC - Perl Regular Expresisons (v2.0)
Using Simple Patterns • To compare a pattern against the contents of $_ $_ = “yabba dabba doo”; if (/abba/) { print “It matched!\n”; } • The regex is enclosed in forward slashes • Spaces inside the backslash are not ignored • The evaluation returns true if the four letter string “abba” is found in $_ • Backslash escapes are available in patterns (just like double quoted strings) /coke\tsprite/ PLC - Perl Regular Expresisons (v2.0)
Metacharacters • There are a number of characters which have a special meaning in a regex • The first is the dot, . , and is known as the wildcard character, which matches any single character except a newline /bet.y/ # true = betty betsy bet=y bet.y # false = bety betsey • The second metacharacter is the backslash, \ , which makes any metacharacter nonspecial • How do I match against these strings? 3.14159 /3\.14159/ b\t /b\\t/ PLC - Perl Regular Expresisons (v2.0)
Simple Quantifiers • Repetition in a pattern can be specified using the following three simplified quantifiers • The star, * , matches the preceding item zero or more times /fred\t*barney/ # fred\tbarney fred\t\tbarney fredbarney /fred.*barney/ # fred + any old junk + barney • The plus, + , matches the preceding item one or more times /fred +barney/ # fred and barney separated by only spaces • The question mark, ?, matches the preceding item zero or one times /bamm-?bamm/ # bamm-bamm or bammbamm PLC - Perl Regular Expresisons (v2.0)
More Metacharacters • Use parentheses, () , for grouping a collection of items for a quantifier match /fred+/ # fred fredd freddd /(fred)+/ # fred fredfred fredfredfred /(fred)*/ # What does this match? • Use the vertical bar, | , to match either the left side or the right side /fred|barney|betty/ # fred barney betty /fred( |\t)+barney/ # “fred barney” “fred \tbarney” • Rewrite the last pattern to match only if the characters between fred and barney are all spaces or all tabs /fred( +|\t+)barney/ PLC - Perl Regular Expresisons (v2.0)
Pattern Testing • Use the unix command, egrep , to test a regular expression against a file • some.txt: barney or fred fred and barney • % egrep ’fred (and|or) barney’ some.txt fred and barney • /usr/local/pub/sps/courses/plc/examples/perl/pattern_test • Write a pattern that matches any string containing: • Any string containing fred or Fred /fred|Fred/ • At least one a followed by any number of b’s /a+b*/ • At least one backslash followed by one or zero asterisks /\\+\*?/ PLC - Perl Regular Expresisons (v2.0)
Character Classes • A character class is a list of possible characters inside square brackets, [] , matching any single character from within the class [abcwxyz] # matches one of those 7 chars • The minus, -, is used to specify a range of characters [a-cw-z] # same, uses hypen for range [a-zA-Z] # 52 upper/lowercase letters [\65-\90] # A-Z $_ = “The HAL-9000 requires authorization to continue.”; if (/HAL-[0-9]+/) { print “The string matches some model of HAL.\n” } PLC - Perl Regular Expresisons (v2.0)
Character Classes • The caret, ^ , is used to negate a character class. This is a way to specify characters not to match against [^def] # match any char except d, e, f • Since minus is special inside a character class, it must be backslashed to make it non-special [^n\-z] # match any char except n, -, z • Write a regular expression that matches 3 letter strings: 1st letter: a, b or c 2nd letter: - 3rd letter: not a, b or c /[abc]-[^abc]/ PLC - Perl Regular Expresisons (v2.0)
Character Class Shortcuts • Some character classes appear so frequently that they have shortcuts \d # a single digit, [0-9] \w # a “word” character, [A-Za-z0-9_] \s # a “space” character, [ \f\t\n\r] • Add quantifiers to beef up your pattern testing \w+ # match a series of one or more word chars \s* # any amount of whitespace (including none) \s+ # one or more whitespace characters PLC - Perl Regular Expresisons (v2.0)
Negating Character Class Shortcuts • To do the opposite of a character class shortcut, use negate, ^ [^\d] # non-digit character, also \D [^\w] # non-word character, also \W [^\s] # non-space character, also \S • What do these match? 1. /[\dA-Fa-f]+/ hex number 2. /[\d\D]/ any digit or nondigit-any character, even newline 3. /[^\d\D]/ anything that’s not a digit or a nondigit, nothing 4. /{\S+, \S+, \S+}/ {1.22, 5, 8.8} PLC - Perl Regular Expresisons (v2.0)
General Quantifiers • We’ve already seen the 3 simple quantifiers, *, + and ? • Use curly braces, {} , to specify the minimum and maximum repetitions allowed /a{5,15}/ # match from 5 to 15 a’s /(fred){3,}/ # match 3 or more fred in a row /\w{8}/ # match exactly 8 word chars • The simple quantifiers, therefore, are just shortcuts for this general form. Can you come up with them? PLC - Perl Regular Expresisons (v2.0)
Anchors • The caret anchor, ^ , marks the beginning of a string • Yes, you’ve seen this used before – inside character classes for negation. The meaning is different outside a character class • The dollar sign, $, marks the end of string (as well as a trailing newline) /^fred/ # matches fred only at start of string /rock$/ # matches rock only at end of string /^fred$/ # only matches fred, alone as a string /^\s*$/ # What does this match? Empty line PLC - Perl Regular Expresisons (v2.0)
Word Anchors • The word boundary, \b , matches on either end of a word. /\bfred\b/ # match the word fred # but not frederick, alfred or manfred mann /\bhunt/ # matches hunt, hunting or hunter # but not shunt /stone\b/ # matches sandstone, flintstone # but not capstones PLC - Perl Regular Expresisons (v2.0)
Word Anchors • The word boundary matches at the start or end of a group of \w characters. Quotes and apostrophes don’t change the word boundaries • The non-word boundary, \B , matches where \b would not /\bsearch\B/ # match searches, searching, searched # but not search or researching PLC - Perl Regular Expresisons (v2.0)
Memory Parentheses • The parentheses are also used for remembering the substring matched by the pattern • Backreferences refer back to an earlier saved memory within the same regex /(.)\1/ # match any char, followed by same char • Each pair of parentheses equals one regex memory • Ordering is outer to inner, left to right /((a|b) (c)) \1/ # matches a c a c or b c b c /((a|b) (c)) \2/ # matches a c a or b c b /((a|b) (c)) \3/ # matches a c c or b c c • Soon we will see how we can extract these memories into memory variables for future use PLC - Perl Regular Expresisons (v2.0)
Precedence • There are four levels of precedence from high to low • Parentheses - used for grouping and memory • Quantifiers - the repeat operators • Anchors and Sequence - string and word boundaries, and sequences of items • Alternation – cuts pattern into two pieces /^fred|barney$/ # fred at start or barney at end /^(fred|barney)$/ # fred or barney at start/end /(wilma|pebbles?)/ # wilma, pebbles, or pebble # (as part of a string) /^(\w+)\s+(\w+)$/ # What does this match? PLC - Perl Regular Expresisons (v2.0)
Regex Cheat Sheet PLC - Perl Regular Expresisons (v2.0)
Regex Cheat Sheet PLC - Perl Regular Expresisons (v2.0)
Testing Your Understanding • Use egrep and write the regular expressions to match on the dictionary file: ~jdb/public_html/plc/perlLab/dictionary.txt • Words starting with a • Words that start with a or A • Words with exactly four letters • Words with exactly four letters and begins with a • Words with four or more letters • Words that contain a capital letter • Words that start with non-capital and include a capital • Words with more than one capital • Words the neither begin nor end with a • Words that begin and end with a vowel • Words that neither begin nor end with a vowel • Words that contain the vowel sequence a-e-i-o-u • Words that contain your initials in order • Words that contain only the letters of your first name PLC - Perl Regular Expresisons (v2.0)
Matches with m// • Evaluating a regex, /fred/ , is actually a shortcut for the pattern match operator, m/fred/ • Any pair of delimeters may be used to quote the contents • To implicitly refer to the match operator, you must use the forward slash • Matching a URL might be easier with a different delimiter m%^http://% PLC - Perl Regular Expresisons (v2.0)
Option Modifiers • Option modifiers are activated by appending a flag to the closing delimiter • Case-insensitive matching is achieved with /i print “Would you like to play a game? “; chomp($_ = <STDIN>) { if (/\byes\b/i) { print “1. Global Thermonuclear War”; … } • The dot matches any character (including newline) with /s $_ = “I saw Barney\nbowling\nwith Fred\nlast night.\n”; if (/\bBarney\b.*\bFred\b/s) { print “The string mentions Fred after Barney!\n”; } PLC - Perl Regular Expresisons (v2.0)
Combining Option Modifiers • Concatenate flags together to apply multiple options to the same pattern if (/\bbarney\b.*\bfred\b/si) { # both /s and /i print “The string mentions Fred after Barney!\n”; } • Use man perlop to get the complete list of options PLC - Perl Regular Expresisons (v2.0)
The Binding Operator • Matching against $_ is the default. To match against an arbitrary string on the left, use=~ my $some_other = “I dream of betty rubble.”; if ($some_other =~ /\brub/) { print “Aye, there’s the rub.\n”; } • This is not an assignment, it is an evaluation that returns the result of the test my $likes_perl = <STDIN> =~ /\byes\b/i; if ($likes_perl) { print “Why don’t you marry it?”; } PLC - Perl Regular Expresisons (v2.0)
Interpolating into Patterns • A regex is double quote interpolated #!/usr/bin/perl –w my $what = “larry”; while (<>) { if (/^($what)/) { # pattern anchored at start print “Saw $what at beginning of $_”; } } PLC - Perl Regular Expresisons (v2.0)
Match Variables • When using memories, (parens), the results are stored into scalar variables with names like $1 and $2 • These variables coincide with the order of the parentheses pairs in the regex and allow us to pull out parts of the string $_ = “Hello there, neighbor”; if (/\s(\w+),/) { # memorize word between space and comma print “the word was $1\n”; # got the word “there” } if (/(\S+) (\S+), (\S+)/) { print “words were $1 $2 $3\n”; # $1=Hello } # $2=there # $3=neighbor PLC - Perl Regular Expresisons (v2.0)
Match Variables • Continually match an expression against a string to find multiple occurrences of the same pattern $_ = "word1 word2 word3 word4"; while (/(\S+)(.*)/) { print "$1\n"; $_ = $2; } • A memory variable may contain an empty string $dino = “A million years.”; if ($dino =~ /(\d*) years/) { # $1=(), #2=undef, #3=undef… } PLC - Perl Regular Expresisons (v2.0)
The Persistence of Memory • The match variables generally stay around until the next successful pattern match • An unsuccessful match leaves the previous match intact $wilma =~ /(\w+)/; # match fails print “Wilma’s word was $1… or was it?\n”; • For maintainability, don’t use a match variable more than a few lines after the pattern match. Copy it into an ordinary variable if ($wilma =~ /(\w+)/ { my $wilma_word = $1; } PLC - Perl Regular Expresisons (v2.0)
Automatic Match Variables • There are three more free variables from a pattern match • Whatever came before the match is in $` • Whatever matched is stored in $& • Whatever came after the match is in $’ if (“Hello there, neighbor” =~ /\s(\w+),/) { print “That was ($`)($&)($’).\n”; } # (Hello)( there,)(neighbor) # $1=there (note the difference between $1 and $&) • Revisit the pattern_test program PLC - Perl Regular Expresisons (v2.0)
Substitution with s/// • Replace whatever part of a variable matches a pattern with a replacement string $_ = “Hey, Barney!”; s/Barney/Fred/; print “$_\n”; # Hey, Fred! • If the match fails, nothing happens and the variable is untouched s/Wilma/Betty/; # $_=“Hey, Fred!” • Memory variables can also be used s/Hey, (\w+)/Yo, $1/; # $_=“Yo, Fred!” PLC - Perl Regular Expresisons (v2.0)
Substitution with s/// • Follow through with this series of substitutions $_ = “green scaly dino”; s/(\w+) (\w+)/$2, $1/; # “scaly, green dino” s/^/huge, /; # “huge, scaly, green dino” s/,.*een//; # “huge dino” (greedy match) s/green/red; # false: “huge dino” s/\w+$/($`!)$&/; # “huge (huge !)dino” s/\s+(!\W+)/$1 /; # “huge (huge!) dino” s/huge/gigantic; # “gigantic (huge!) dino” • The return value is true if the substitution was successful PLC - Perl Regular Expresisons (v2.0)
Global Replacement with /g • The /g modifier tells s/// to make all possible nonoverlapping replacements $_ = “home, sweet home!”; s/home/cave/g; # $_=“cave, sweet cave!” • Collapsing whitespace $_ = “Input data\t may have extra whitespace.”; s/\s+/ /g; # “Input data may have extra whitespace.” • Stripping whitespace s/^\s+//; # Replace leading whitespace with nothing s/\s+$//; # Replace trailing whitespace with nothing PLC - Perl Regular Expresisons (v2.0)
The Binding Operator and Case Shifting • Choose a different target for s/// by using the binding operator $file = “/usr/bin/perl”; $file =~ s#^.*/##s; # $file=perl # (s means dot matches any character) • Use \U and \L to force what follows to uppercase and lowercase $_ = “I saw Barney with Fred.”; s/(fred|barney)/\U$1/gi; # “I saw BARNEY with FRED.” s/(fred|barney)/\L$1/gi; # “I saw barney with fred.” • Turn off case shifting with \E PLC - Perl Regular Expresisons (v2.0)
The split Operator • Use split to break a string according to a separator @fields = split /:/, “abc:def:g:h”; #gives (“abc”, “def”, “g”, “h”) @fields = split /:/, “abc:def::g:h”; #gives (“abc”, “def”, “”, “g”, “h”) • Leading empty fields are returned but trailers are not @fields = split /:/, “:::a:b:c:::”; #gives (“”, “”, “”, “a”, “b”, “c”) PLC - Perl Regular Expresisons (v2.0)
The split Operator • Whitespace runs are equivalent to a single space my $some_input = “This is a \t test.\n”; my @args = split /\s+/, $some_input; #(“This”, “is”, “a”, “test.”) • The default for split is to break up $_ on whitespace my @fields = split; # like: split /\s+/, $_; PLC - Perl Regular Expresisons (v2.0)
The join Function • The join function doesn’t use patterns, but it is considered the opposite of split my $x = join “:”, 4, 6, 8, 10, 12; # $x=4:6:8:10:12 my $y = join “foo”, “bar” # gives just “bar” (foo not needed as glue) my @empty; my $empty = join “baz”, @empty; # no items, it’s an empty string my @values = split /:/, $x; # @values=(4,6,8,10,12) my $z = join “-”, @values; # $z=“4-6-8-10-12” PLC - Perl Regular Expresisons (v2.0)
Testing Your Understanding • Make a pattern that will match three consecutive copies of whatever is currently contained in $what. That is, if $what is fred, your pattern should match fredfredfred. If $what is fred|barney, your pattern should match fredfredbarney or barneyfredfred or barneybarneybarney or many other variations. PLC - Perl Regular Expresisons (v2.0)
Revision History • Revision History • v1.00, 10/19/2003 11:27 PM, sps Initial revision. • v1.01, 1/26/2004 3:12 PM, sps 20032 updates. -- v2.0, 1/17/2005, chr PLC - Perl Regular Expresisons (v2.0)