Regular Expressions in Perl – Part 1

Regular Expressions in Perl – Part 1 Karthik Sangaiah

Perl Text Processing • Developed by Larry Wall • “There’s more than one way to do it” • “Easy things should be easy and hard things should be possible” • Main purpose of Perl was for text manipulation • Regular Expressions fundamental to text processing

Regular Expressions(Regex) • String that describes a pattern • Simplest regex is a word • A regex consisting of a word matches any string that contains that word • Ex: • “Hello World” =~ /World/

Basic Regex Operators • “!~” operator produces TRUE of regex does NOT match a string • Ex: • if (“Sample Words” !~ /Sample/) { print “It doesn’t match\n”; else { print “It matches\n”; } • “=~” operator produces TRUE if regex matches a string • Ex: • if (“Sample Words” =~ /Sample/) { print “It matches\n”; else { print “It doesn’t match\n”; }

Variables in Regex – Part 1 • Can use variable as regex • Ex: $temp = “ls” “ls - l” =~ /$temp/ • If using default variable “$_”: • “$_ =~” can be omitted • Ex: $_ = “ls -l”; if (/ls/) { print “It matches\n”; } else { print “It doesn’t match\n”; }

Variables in Regex – Part 2 • Regexs in Perl are mostly treated as double-quoted Strings • Values of variables in regex will be subtituted in before regex is evaluated for matching • Ex:$foo = ‘vision’;‘television’ =~ /tele$foo/;

Delimiters • “/ /” default delimiters can be changed to arbitrary delimiters by using “=~ m” • Ex:“Sample Text” =~ m!Text!;“Sample Text” =~ m{Text};“Sample Text” =~ m“Text”;

Metacharacters • Reserved for use in regex notations • { }, [ ], ( ), ^, $, ., |, *, +, ?, \ • Need to use “\” before use of a metacharacter in the regex • Ex: • “5*2=10" =~ /5\*2/; • "/usr/bin/perl" =~ /\/usr\/bin\/perl/; • “/” also needs to be backslashed if it’s used as the delimiter

Anchor Metacharacters • “^” matches at beginning of string • “$” matches at end of string or before new line at end of string • Ex:“television” =~ /^tele/;“television” =~ /vision$/; • When using “^” and “$”, regex has to match in beginning and end of string (i.e. match whole string). • Ex:“vision” =~ /^vision$/;

Character Classes – Part 1 • Allows a set of possible characters, rather than a single character to match • Character classes denoted by […] with a set of characters matched inside • Ex./[btc]all/; #Matches ball, tall, or call/word[0123456789]/; #Matches word0…word9

Character Classes – Part 2 • Special characters in character class are handled with backslash as well • Special characters within character class: • “-”, “]”, “\”, “^”, “$”, “.”, “]” • Ex:/[\$c]w/; #matches $w or cw$x = ‘btc’;/[$x]all/; #matches ball, tall, or call/[\$x]all/; #matches $all or xall/[\\$x]all/; #matches \all, ball, tall, or call

Character Classes – Part 3 • Special Char. “-” used as range operator • Ex:/word[0-9]/; #matches word0…word9/word[0-9a-z] /; #matches word0… word9, or worda… wordz • Special Char. “^” in first position of character class denotes a negated character class • Ex:/[^0-9]/; #matches a non-numeric character

Character Classes – Part 4 • Common character class abbreviations: • \d – digit, [0-9] • \s – whitespace character, [\ \t\r\n\f] • \w – word character(alphanumeric or _), • \D – negated \d • \S – negated \s • \W – negated \w • . – any character but “\n” • Abbreviations can be used inside and outside character classes

Word Anchors • “\b” matches boundary between a word character and a non-word character • Ex: • $x = “Exam1 Question from Sample Exam”; • $x =~ /Exam/; #matches Exam in Exam1 • $x =~ /\bExam/; #matches cat in Exam • $x =~ /\bExam\b/; #matches cat at end of string

Single line vs. Multi-line – Part 1 • Often, we want to match against lines and ignore newline characters • Sometimes we need to keep track of newlines. • //s – Single line matching • //m – Multi-line matching • These modifiers affect two aspects how the regex is interpreted: • How the ‘.’ character class is defined • Where the anchor, ^ and $, are able to match

Single line vs. Multi-line – Part 2 • No modifier (//) – Default • . matches all characters but \n • ^ matches at beginning of string • $ matches at end of string or before a newline at the end of string • String as Single long line (//s) • . matches any character • ^ matches at beginning of string • $ matches end of string or before a newline at the end of string

Single line vs. Multi-line – Part 3 • String as Multiple lines (//m) • . matches all characters but \n • ^ matches at beginning of any line within the string • $ matches end of any line within the string • String as Single long line but detect mutliple lines (//sm) • . matches any character • ^ matches at beginning of any line within the string • $ matches end of any line within the string

Single line vs. Multi-line – Part 4 • $x = “You will know how to use Perl\nFor text processing\n"; • $x =~ /^For/; # No match, “For" not at start of string • $x =~ /^For/s; # No match, “For" not at start of string • $x =~ /^For/m; # match, “For" at start of second line • $x =~ /^For/sm; # match, “For" at start of second line

Alternation • Alternation metacharacter “|” • Used to match different possible words or character strings • Word 1 or word 2 -> /word1|word2/; • Perl tries to match the regex at earliest possible point in the string • Ex.“shoes and strings” =~ /shoes/strings/and/; #matches shoes“shoes” =~ /s|sh|sho|shoes/; #matches “s”“shoes” =~ /shoes|sho|s/; #matches “cats”

Resources • Perl Resource 5: Perl Regular Expressions Tutorial • http://www.cs.drexel.edu/~knowak/cs265_fall_2010/perlretut_2007.pdf • Perl History • http://www.xmluk.org/perl-cgi-history-information.htm • Perl Special Variables • http://www.kichwa.com/quik_ref/spec_variables.html

Questions?

Regular Expressions in Perl – Part 1