220 likes | 243 Views
Regular Expressions in Perl – Part 1. Karthik Sangaiah. Perl Text Processing. Developed by Larry Wall “There’s more than one way to do it” “Easy things should be easy and hard things should be possible” Main purpose of Perl was for text manipulation
E N D
Regular Expressions in Perl – Part 1 Karthik Sangaiah
Perl Text Processing • Developed by Larry Wall • “There’s more than one way to do it” • “Easy things should be easy and hard things should be possible” • Main purpose of Perl was for text manipulation • Regular Expressions fundamental to text processing
Regular Expressions(Regex) • String that describes a pattern • Simplest regex is a word • A regex consisting of a word matches any string that contains that word • Ex: • “Hello World” =~ /World/
Basic Regex Operators • “!~” operator produces TRUE of regex does NOT match a string • Ex: • if (“Sample Words” !~ /Sample/) { print “It doesn’t match\n”; else { print “It matches\n”; } • “=~” operator produces TRUE if regex matches a string • Ex: • if (“Sample Words” =~ /Sample/) { print “It matches\n”; else { print “It doesn’t match\n”; }
Variables in Regex – Part 1 • Can use variable as regex • Ex: $temp = “ls” “ls - l” =~ /$temp/ • If using default variable “$_”: • “$_ =~” can be omitted • Ex: $_ = “ls -l”; if (/ls/) { print “It matches\n”; } else { print “It doesn’t match\n”; }
Variables in Regex – Part 2 • Regexs in Perl are mostly treated as double-quoted Strings • Values of variables in regex will be subtituted in before regex is evaluated for matching • Ex:$foo = ‘vision’;‘television’ =~ /tele$foo/;
Delimiters • “/ /” default delimiters can be changed to arbitrary delimiters by using “=~ m” • Ex:“Sample Text” =~ m!Text!;“Sample Text” =~ m{Text};“Sample Text” =~ m“Text”;
Metacharacters • Reserved for use in regex notations • { }, [ ], ( ), ^, $, ., |, *, +, ?, \ • Need to use “\” before use of a metacharacter in the regex • Ex: • “5*2=10" =~ /5\*2/; • "/usr/bin/perl" =~ /\/usr\/bin\/perl/; • “/” also needs to be backslashed if it’s used as the delimiter
Anchor Metacharacters • “^” matches at beginning of string • “$” matches at end of string or before new line at end of string • Ex:“television” =~ /^tele/;“television” =~ /vision$/; • When using “^” and “$”, regex has to match in beginning and end of string (i.e. match whole string). • Ex:“vision” =~ /^vision$/;
Character Classes – Part 1 • Allows a set of possible characters, rather than a single character to match • Character classes denoted by […] with a set of characters matched inside • Ex./[btc]all/; #Matches ball, tall, or call/word[0123456789]/; #Matches word0…word9
Character Classes – Part 2 • Special characters in character class are handled with backslash as well • Special characters within character class: • “-”, “]”, “\”, “^”, “$”, “.”, “]” • Ex:/[\$c]w/; #matches $w or cw$x = ‘btc’;/[$x]all/; #matches ball, tall, or call/[\$x]all/; #matches $all or xall/[\\$x]all/; #matches \all, ball, tall, or call
Character Classes – Part 3 • Special Char. “-” used as range operator • Ex:/word[0-9]/; #matches word0…word9/word[0-9a-z] /; #matches word0… word9, or worda… wordz • Special Char. “^” in first position of character class denotes a negated character class • Ex:/[^0-9]/; #matches a non-numeric character
Character Classes – Part 4 • Common character class abbreviations: • \d – digit, [0-9] • \s – whitespace character, [\ \t\r\n\f] • \w – word character(alphanumeric or _), • \D – negated \d • \S – negated \s • \W – negated \w • . – any character but “\n” • Abbreviations can be used inside and outside character classes
Word Anchors • “\b” matches boundary between a word character and a non-word character • Ex: • $x = “Exam1 Question from Sample Exam”; • $x =~ /Exam/; #matches Exam in Exam1 • $x =~ /\bExam/; #matches cat in Exam • $x =~ /\bExam\b/; #matches cat at end of string
Single line vs. Multi-line – Part 1 • Often, we want to match against lines and ignore newline characters • Sometimes we need to keep track of newlines. • //s – Single line matching • //m – Multi-line matching • These modifiers affect two aspects how the regex is interpreted: • How the ‘.’ character class is defined • Where the anchor, ^ and $, are able to match
Single line vs. Multi-line – Part 2 • No modifier (//) – Default • . matches all characters but \n • ^ matches at beginning of string • $ matches at end of string or before a newline at the end of string • String as Single long line (//s) • . matches any character • ^ matches at beginning of string • $ matches end of string or before a newline at the end of string
Single line vs. Multi-line – Part 3 • String as Multiple lines (//m) • . matches all characters but \n • ^ matches at beginning of any line within the string • $ matches end of any line within the string • String as Single long line but detect mutliple lines (//sm) • . matches any character • ^ matches at beginning of any line within the string • $ matches end of any line within the string
Single line vs. Multi-line – Part 4 • $x = “You will know how to use Perl\nFor text processing\n"; • $x =~ /^For/; # No match, “For" not at start of string • $x =~ /^For/s; # No match, “For" not at start of string • $x =~ /^For/m; # match, “For" at start of second line • $x =~ /^For/sm; # match, “For" at start of second line
Alternation • Alternation metacharacter “|” • Used to match different possible words or character strings • Word 1 or word 2 -> /word1|word2/; • Perl tries to match the regex at earliest possible point in the string • Ex.“shoes and strings” =~ /shoes/strings/and/; #matches shoes“shoes” =~ /s|sh|sho|shoes/; #matches “s”“shoes” =~ /shoes|sho|s/; #matches “cats”
Resources • Perl Resource 5: Perl Regular Expressions Tutorial • http://www.cs.drexel.edu/~knowak/cs265_fall_2010/perlretut_2007.pdf • Perl History • http://www.xmluk.org/perl-cgi-history-information.htm • Perl Special Variables • http://www.kichwa.com/quik_ref/spec_variables.html