170 likes | 495 Views
Regular Expressions. A simple and powerful way to match characters. Laurent Falquet, EPFL March, 2005. Swiss Institute of Bioinformatics Swiss EMBnet node. Regular Expressions. What is a regular expression? Literal (or normal characters) Alphanumeric abc…ABC…0123... Punctuation
E N D
Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node
Regular Expressions • What is a regular expression? • Literal (or normal characters) • Alphanumeric • abc…ABC…0123... • Punctuation • -_ ,.;:=()/+ *%&{}[]?!$’^|\<>"@# • Metacharacters • Ex: ls *.java • Flavors… • awk, egrep, Emacs, grep, Perl, POSIX, Tcl, PROSITE !
Example: PROSITE Patterns are regular expressions • Pattern: <A-x-[ST](2)-x(0,1)-{V} • Perl Regexp: ^A.[ST]{2}.?[^V] • Text:The sequence must start with an alanine, followed by any amino acid, followed by a serine or a threonine, two times, followed by any amino acid or nothing, followed by any amino acid except a valine. • Simply the syntax differ…
In Perl: /…/ Start and End of line ^ start, $ end Match any of several […] or (…|…) Match 0, 1 or more . 1 of any ? 0 or 1 + 1 or more * 0 or more {m,n} range ! negation Regular Expressions (1) • Examples • Match every instance of a SwissProt AC • m/[OPQ][0-9][A-Z0-9]{3}[0-9]/; • m/ [OPQ]\d[A-Z0-9]{3}\d/; • Match every instance of a SwissProt ID • m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;
Escape character or back reference \char or \num Shorthand \d digit [0-9] \s whitespace [space\f\n\r\t] \w character [a-zA-Z0-9_] \D\S\W complement of \d\s\w Byte notation \num character in octal \xnum character in hexadecimal \cchar control character Match operator m/…/ $var =~ m/colou?r/; $var !~ m/colou?r/; Substitution operator s/…/…/ $var =~ s/colou?r/couleur/; Translate operator tr/…/…/ $revcomp =~ tr/ACGT/tgca/; Modifiers /…/# /i case insensitive /g global match Many other /s,/m,/o,/x... Regular Expressions (2)
Grouping External reference $var =~ s/sp\:(\w\d{5})/swissprot AC=$1/; Internal reference $var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/; Numbering $1 to $9 $10 to more if needed... Exercises Create a regexp to recognize any pseudo IP address: 012.345.678.912 Create a regexp to recognize any email address: Jean.Dupond@isb-sib.ch Create a regexp to change any HTML tag to another <address> -> <pre> On sib-dea: use visual_regexp-1.2.tcl to check your regular expressions (requires X-windows) Regular Expressions (3)
/[\d{1,3}\.]{3}\d{1,3}/ /\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/ /\<(\/?)address\>/\<$1pre\>/ generalized: address = \w+ Solution RegExp
-a autosplit (only with -n or -p) -c check syntax -d debugger -e pass script lines -h help -i direct editing of a file -n loop without print -p loop with print -v version … Example: perl -e 'print qq(hello world\n);' In-liners: some options
perl -pe ‘s/\r/\n/g’ <file> is equivalent to: open READ, “file”; while (<READ>) { s/\r/\n/g; print; } close(READ); perl -i -pe ‘s/\r/\n/g’ <file> Warning: the -i option modifies the file directly perl -ne is the same without the “print” In-liners: -n and -p
perl -ane ‘print @F, “\n”;’ <file> is equivalent to: open READ, “file”; while (<READ>) { @F = split(‘ ‘); print @F, “\n”; } close(READ); Example: hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";' In-liners: -a (only with -n or -p)
sw:ICEA_XENLA 1 90 prf:CARD 5 -1 18.553 sw:RIK2_MOUSE 435 513 prf:CARD 5 -11 15.058 sw:CARC_HUMAN 1 88 prf:CARD 6 -1 15.395 sw:NAL1_HUMAN 1380 1463 prf:CARD 7 -1 15.058 sw:ASC_HUMAN 113 195 prf:CARD 7 -2 15.374 sw:CAR8_HUMAN 347 430 prf:CARD 8 -1 18.343 sw:CARF_HUMAN 134 218 prf:CARD 9 -1 12.932 hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";' In-liners: -a (only with -n or -p) • hits -b 'sw' -o pff2 prf:CARD 18.553 -1 5 prf:CARD 90 1 sw:ICEA_XENLA 15.058 -11 5 prf:CARD 513 435 sw:RIK2_MOUSE 15.395 -1 6 prf:CARD 88 1 sw:CARC_HUMAN 15.058 -1 7 prf:CARD 1463 1380 sw:NAL1_HUMAN 15.374 -2 7 prf:CARD 195 113 sw:ASC_HUMAN 18.343 -1 8 prf:CARD 430 347 sw:CAR8_HUMAN 12.932 -1 9 prf:CARD 218 134 sw:CARF_HUMAN
perl -e ‘print int(rand(100)),"\n" for 1..100' | perl -e '$x{$_}=1 while <>;print sort {$a<=>$b} keys %x' for($i=0;$i<100;$i++) { $nb = int(rand(100)); $hash{$nb} = 1; } print sort {$a<=>$b} keys %hash; In-liners: examples
cat /db/proteome/ECOLI.dat | perl -ne ‘if (/^ID +(\w+)/) {print">$1\n";} if(/^ /) {s/ //g; print}’ open (READ, “/db/proteome/ECOLI.dat”); # open file while ($line=<READ>) { # read line by line until the end if($line=~ /^ID +(\w+)/) { print “>$1\n”; } # print fasta header if($line=~ /^ /) { $line =~ s/ //g; # remove spaces print $line; # print sequence line } } close(READ); In-liners: extract FASTA from SP
Create an In-liner that extracts non-redundant FASTA format sequences from a redundant database in SwissProt format cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/) {print ">$1\n”;} if(/^ /) {s/ //g; print}' | perl -e 'while(<>) { if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values %x’ In-liners: your turn…