1 / 16

Regular Expressions

Regular Expressions. A simple and powerful way to match characters. Laurent Falquet, EPFL March, 2005. Swiss Institute of Bioinformatics Swiss EMBnet node. Regular Expressions. What is a regular expression? Literal (or normal characters) Alphanumeric abc…ABC…0123... Punctuation

jeremye
Download Presentation

Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node

  2. Regular Expressions • What is a regular expression? • Literal (or normal characters) • Alphanumeric • abc…ABC…0123... • Punctuation • -_ ,.;:=()/+ *%&{}[]?!$’^|\<>"@# • Metacharacters • Ex: ls *.java • Flavors… • awk, egrep, Emacs, grep, Perl, POSIX, Tcl, PROSITE !

  3. Example: PROSITE Patterns are regular expressions • Pattern: <A-x-[ST](2)-x(0,1)-{V} • Perl Regexp: ^A.[ST]{2}.?[^V] • Text:The sequence must start with an alanine, followed by any amino acid, followed by a serine or a threonine, two times, followed by any amino acid or nothing, followed by any amino acid except a valine. • Simply the syntax differ…

  4. In Perl: /…/ Start and End of line ^ start, $ end Match any of several […] or (…|…) Match 0, 1 or more . 1 of any ? 0 or 1 + 1 or more * 0 or more {m,n} range ! negation Regular Expressions (1) • Examples • Match every instance of a SwissProt AC • m/[OPQ][0-9][A-Z0-9]{3}[0-9]/; • m/ [OPQ]\d[A-Z0-9]{3}\d/; • Match every instance of a SwissProt ID • m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;

  5. Escape character or back reference \char or \num Shorthand \d digit [0-9] \s whitespace [space\f\n\r\t] \w character [a-zA-Z0-9_] \D\S\W complement of \d\s\w Byte notation \num character in octal \xnum character in hexadecimal \cchar control character Match operator m/…/ $var =~ m/colou?r/; $var !~ m/colou?r/; Substitution operator s/…/…/ $var =~ s/colou?r/couleur/; Translate operator tr/…/…/ $revcomp =~ tr/ACGT/tgca/; Modifiers /…/# /i case insensitive /g global match Many other /s,/m,/o,/x... Regular Expressions (2)

  6. Grouping External reference $var =~ s/sp\:(\w\d{5})/swissprot AC=$1/; Internal reference $var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/; Numbering $1 to $9 $10 to more if needed... Exercises Create a regexp to recognize any pseudo IP address: 012.345.678.912 Create a regexp to recognize any email address: Jean.Dupond@isb-sib.ch Create a regexp to change any HTML tag to another <address> -> <pre> On sib-dea: use visual_regexp-1.2.tcl to check your regular expressions (requires X-windows) Regular Expressions (3)

  7. Regular Expressions (4)

  8. /[\d{1,3}\.]{3}\d{1,3}/ /\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/ /\<(\/?)address\>/\<$1pre\>/ generalized: address = \w+ Solution RegExp

  9. Perl In-liners

  10. -a autosplit (only with -n or -p) -c check syntax -d debugger -e pass script lines -h help -i direct editing of a file -n loop without print -p loop with print -v version … Example: perl -e 'print qq(hello world\n);' In-liners: some options

  11. perl -pe ‘s/\r/\n/g’ <file> is equivalent to: open READ, “file”; while (<READ>) { s/\r/\n/g; print; } close(READ); perl -i -pe ‘s/\r/\n/g’ <file> Warning: the -i option modifies the file directly perl -ne is the same without the “print” In-liners: -n and -p

  12. perl -ane ‘print @F, “\n”;’ <file> is equivalent to: open READ, “file”; while (<READ>) { @F = split(‘ ‘); print @F, “\n”; } close(READ); Example: hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";' In-liners: -a (only with -n or -p)

  13. sw:ICEA_XENLA 1 90 prf:CARD 5 -1 18.553 sw:RIK2_MOUSE 435 513 prf:CARD 5 -11 15.058 sw:CARC_HUMAN 1 88 prf:CARD 6 -1 15.395 sw:NAL1_HUMAN 1380 1463 prf:CARD 7 -1 15.058 sw:ASC_HUMAN 113 195 prf:CARD 7 -2 15.374 sw:CAR8_HUMAN 347 430 prf:CARD 8 -1 18.343 sw:CARF_HUMAN 134 218 prf:CARD 9 -1 12.932 hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";' In-liners: -a (only with -n or -p) • hits -b 'sw' -o pff2 prf:CARD 18.553 -1 5 prf:CARD 90 1 sw:ICEA_XENLA 15.058 -11 5 prf:CARD 513 435 sw:RIK2_MOUSE 15.395 -1 6 prf:CARD 88 1 sw:CARC_HUMAN 15.058 -1 7 prf:CARD 1463 1380 sw:NAL1_HUMAN 15.374 -2 7 prf:CARD 195 113 sw:ASC_HUMAN 18.343 -1 8 prf:CARD 430 347 sw:CAR8_HUMAN 12.932 -1 9 prf:CARD 218 134 sw:CARF_HUMAN

  14. perl -e ‘print int(rand(100)),"\n" for 1..100' | perl -e '$x{$_}=1 while <>;print sort {$a<=>$b} keys %x' for($i=0;$i<100;$i++) { $nb = int(rand(100)); $hash{$nb} = 1; } print sort {$a<=>$b} keys %hash; In-liners: examples

  15. cat /db/proteome/ECOLI.dat | perl -ne ‘if (/^ID +(\w+)/) {print">$1\n";} if(/^ /) {s/ //g; print}’ open (READ, “/db/proteome/ECOLI.dat”); # open file while ($line=<READ>) { # read line by line until the end if($line=~ /^ID +(\w+)/) { print “>$1\n”; } # print fasta header if($line=~ /^ /) { $line =~ s/ //g; # remove spaces print $line; # print sequence line } } close(READ); In-liners: extract FASTA from SP

  16. Create an In-liner that extracts non-redundant FASTA format sequences from a redundant database in SwissProt format cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/) {print ">$1\n”;} if(/^ /) {s/ //g; print}' | perl -e 'while(<>) { if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values %x’ In-liners: your turn…

More Related