160 likes | 348 Views
CS/BIO 271 – Introduction to Bioinformatics. Regular Expressions in Perl. Regular Expressions. Regular expressions are a powerful tool for matching patterns against strings Available in many languages (AWK, Sed, Perl, Python, Ruby, C/C++, others)
E N D
CS/BIO 271 – Introduction to Bioinformatics Regular Expressionsin Perl
Regular Expressions • Regular expressions are a powerful tool for matching patterns against strings • Available in many languages (AWK, Sed, Perl, Python, Ruby, C/C++, others) • Matching strings with RegExp’s is very efficient and fast Types & Regular Expressions
RegExp basics • A regular expression is a pattern that can be compared to a string • A regular expression is created using the / / delimiters: • /^[abc].*f$/ • A regular expression is matched using the =~ (binding) operator • A regular expression match returns true or false • if ($mystring =~ /^[abc].*f$/) { } Types & Regular Expressions
String Matching • Examples of a few simple regular expressions $a = "Fats Waller"; $a =~ /a/ » 1 (true) $a =~ /z/ » nil (false) $a =~ /ll/ » 1 (true) Types & Regular Expressions
Regular Expression Patterns • Most characters match themselves • Wildcard: . (period) = any character • Anchors • ^ = “start of line” • $ = “end of line” Types & Regular Expressions
Character Classes • Character classes: appear within [] pairs • Most special Regexp characters (^, $, etc) turned off • Escape sequences (\n etc) still work • [aeiou] • [0-9] • ^ as first character = negate the class • You can use the literal characters ] and – if they appear first: []-abn-z] Types & Regular Expressions
Predefined character classes • These work inside or outside []’s: • \d = digit = [0-9] • \D = non-digit = [^0-9] • \s = whitespace, \S = non-whitespace • \w = word character [a-zA-Z0-9_] • \W = non-word character Types & Regular Expressions
Repetition in Regexps • These quantify the preceding character or class: • * = zero or more • + = one or more • ? = zero or one • {m, n} = at least m and at most n • {m, } = at least m • High precedence – Only matches one character or class, unless grouped: • /^ran*$/ vs. /^r(an)*$/ Types & Regular Expressions
Alternation • | is like “or” – matches either the regexp before the | or the one after • Low precedence – alternates entire regexps unless grouped • /red ball|angry sky/ matches “red ball” or “angry sky” not “red ball sky” or “red angry sky) • /red (ball|angry) sky/ does the latter Types & Regular Expressions
Side Effects (Perl Magic) • After you match a regular expression some “special” Perl variables are automatically set: • $& – the part of the expression that matched the pattern • $‘ – the part of the string before the pattern • $’ – the part of the string after the pattern Types & Regular Expressions
Side effects and grouping • When you use ()’s for grouping, Perl assigns the match within the first () pair to: • \1 within the pattern • $1 outside the pattern “mississippi” =~ /^.*(iss)+.*$/ » $1 = “iss” /([aeiou][aeiou]).*\1/ Types & Regular Expressions
Repetition and greediness • By default, repetition is greedy, meaning that it will assign as many characters as possible. • You can make a repetition modifier non-greedy by adding ‘?’ a = "The moon is made of cheese“ showRE(a, /\w+/) » <<The>> moon is made of cheese showRE(a, /\s.*\s/) » The<< moon is made of >>cheese showRE(a, /\s.*?\s/) » The<< moon >>is made of cheese showRE(a, /[aeiou]{2,99}/) » The m<<oo>>n is made of cheese showRE(a, /mo?o/) » The <<moo>>n is made of cheese Types & Regular Expressions
RegExp Substitutions Types & Regular Expressions
Using RegExps • Repeated regexps with list context and /g • Single matches Types & Regular Expressions