250 likes | 961 Views
Regular Expression 1. What is regular expression? An expression of a pattern in a string using special characters and words. 2. When and where we use it? Regular expression is used to parse an output from a software , for example, BLAST, or used to extract information you need
E N D
Regular Expression 1. What is regular expression? An expression of a pattern in a string using special characters and words. 2. When and where we use it? Regular expression is used to parse an output from a software , for example, BLAST, or used to extract information you need from a text file. When a string | line matches the pattern, it is extracted. Therefore, it is extremely useful.
String match: Two different formats while (<>){ chomp; $line = $_; if (/drought/) { # check if current line $_ contains “drought”; print “$_\n”; } # how to check if a variable string contains “drought” if ($line =~m/drought/) { print “$line\n”; } }
#!/usr/bin/perl use strict; my $infile=shift; open (IN, "$infile") || die "Can not open input file -- $infile \n"; while (<IN>){ chomp; if ((/drought/) && (/salt/)) { print "drought_salt\t$_\n"; } elsif (/calcium/) { print "calcium:\t$_\n"; } elsif (/cold/){ print "cold:\t$_\n"; } } close (IN);
. Match any character\w Match "word" character (alphanumeric plus "_") \W Match non-word character\s Match whitespace character\S Match non-whitespace character\d Match digit character\D Match non-digit character\t Match tab\n Match newline
If (/\d+/) { print “match_digit: $_\n”; } elsif (/\w+/) { print “match_character:$_\n”; } +: one or more *: zero or more
* Match 0 or more times+ Match 1 or more times? Match 1 or 0 times{n} Match exactly n times{n,} Match at least n times{n,m} Match at least n but not more than m times
if($str =~m/(A|E|I|O|U|a|e|i|o|u)/) { print "String contains a vowel!\n” } if($string =~ /[^AEIOUYaeiouy]/){ print “String contains a non-vowel\n“; }
Volume in drive D has no label Volume Serial Number is 4547-15E0 Directory of D:\polo\marco. <DIR> 12-18-97 11:14a ... <DIR> 12-18-97 11:14a ..INDEX HTM 3,237 02-06-98 3:12p index.htmAPPDEV HTM 6,388 12-24-97 5:13p appdev.htmNORM HTM 5,297 12-24-97 5:13p norm.htmIMAGES <DIR> 12-18-97 11:14a imagesTCBK GIF 532 06-02-97 3:14p tcbk.gifLSQL HTM 5,027 12-24-97 5:13p lsql.htmCRASHPRF HTM 11,403 12-24-97 5:13p crashprf.htmWS_FTP LOG 5,416 12-24-97 5:24p WS_FTP.LOGFIBB HTM 10,234 12-24-97 5:13p fibb.htmMEMLEAK HTM 19,736 12-24-97 5:13p memleak.htmLITTPERL <DIR> 02-06-98 1:58p littperl 9 file(s) 67,270 bytes 4 dir(s) 132,464,640 bytes free
What are \w and \W ? \w Match "word" character (alphanumeric plus "_") any word or character of [a-zA-Z0-9_] \W Match non-word character [^a-zA-Z0-9_]
Grouping : Perl stores whatever are within () sequentially to variable 1, 2, …, $_= “Forests are important to Human”; If (/(\w+)\W+(\w+)/) { print “$1\n$2\n”; } Or you can do this ($first, $second) = (/(\w+)\W+(\w+)/; print “$first\n$second\n”;
Translation: Translations are like substitutions, except they happen on a letter by letter basis instead of substituting a single phrase for another single phrase. For instance, what if you wanted to make all vowels upper case: # Change DNA sequence from low case to upper case: $string =~ tr/[a,t,c,g]/[A,T,C,G]/; # Change everything to upper case: $string =~ tr/[a-z]/[A-Z]/; Change everything to lower case $string =~ tr/[A-Z]/[a-z]/;
Perl regular expressions normally match the longest string possible. For instance: my($text) = "mississippi"; $text =~ m/(i.*s)/;print $1 . "\n"; Run the preceding code, and here's what you get: ississ It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code: my($text) = "mississippi"; $text =~ m/(i.*?s)/; # Match 1 or 0 times print $1 . "\n"; Now look what the code produces: is
\b Match a word boundary \B Match a non-(word boundary) \A Match only at beginning of string \Z Match only at end of string, or before newline at the end \z Match only at end of string \G Match only where previous m//g left off (works only with /g) For example If (/Fred\b/) ---matches Fred, but not Frederick / \bTech\b ---matches Tech, but not MichiganTech or Technological \B requires that there not a word boundary /\bFred\B/ ----matches Frederick but not Fred Christopher
The \A and \Z are just like ``^'' and ``$'', except that they won't match multiple times when the /m modifier is used Pattern match modifiers m/PATTERN/cgimosx /PATTERN/cgimosx Options are: cDo not reset search position on a failed match when /g is in effect. gMatch globally, i.e., find all occurrences. IDo case-insensitive pattern matching. mTreat text as multiple lines, allow anchors to match before and after newline o Compile pattern once. s Treat text as single line, allowing newlines to match x Use extended regular expressions.
Substitute $str = “foot fool buffoon”; $str = s/foo/bar/g; #str now is “bart barl bufbarn” g (global ) tells Perl to replace on all matches. $str = “foot Fool buffoon”; $str = s/foo/bar/gi; #str now is “bart barl bufbarn
#!/usr/bin/perl use warnings; $_ = "my test string goes here"; while (/(\w+)/gi) { $word = $&; # Contains the string matched by the last pattern match while ($word =~ /e/gi) { $count++; if ($count == 3) { print "$word\n"; $count = 0; } } }