1 / 33

Regular Expressions

Regular Expressions. Provide a way of writing a compact description of a set of strings Sort of like wildcards Single character patterns A single character matches itself A “.” matches any single character except newline [characters] – matches any one of the characters

howen
Download Presentation

Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expressions • Provide a way of writing a compact description of a set of strings • Sort of like wildcards • Single character patterns • A single character matches itself • A “.” matches any single character except newline • [characters] – matches any one of the characters • ^ means “does not match” Regular Expressions

  2. Examples • G • [0123456789] • [0-9] • [a-zA-z] • [^0-9] Regular Expressions

  3. Multipliers Regular Expressions

  4. Character Class Abbreviations Regular Expressions

  5. Alternation • Sometimes we would like to match different possible words or character strings. • This is accomplished by using ‘|’ • To match dog or cat • dog|cat • At each character position, perl will first try to match the first alternative, dog. If dog doesn't match, perl will then try the next alternative, cat. If cat doesn't match either, then the match fails and perl moves to the next position in the string. Regular Expressions

  6. Your Turn!!! • To check your understanding of regular expressions… • There is redundancy in the genetic code • GCU, GCC, GCA, and GCG all encode Alanine • Write a series of regular expressions that could be used to match a codon to the amino acid it encodes • For example • (GC.) would match all the codons that encode Alanine Regular Expressions

  7. Fill In The Blanks Regular Expressions

  8. The Answers Regular Expressions

  9. The Regular Expression Engine • Think of Perl using a “railway” diagram of connected states • Perl moves through the diagram by matching characters • If the engine reaches the final state, it has matched the input string Regular Expressions

  10. abc a b c Start ‘a’ ‘b’ Match 12ababc Regular Expressions

  11. abc a b c Start ‘a’ ‘b’ Match 12ababc Regular Expressions

  12. abc a b c Start ‘a’ ‘b’ Match 12ababc Regular Expressions

  13. abc a b c Start ‘a’ ‘b’ Match 12ababc Regular Expressions

  14. abc a b c Start ‘a’ ‘b’ Match 12ababc Regular Expressions

  15. My Problem XXXX, ROBERT 4653 N VCSG-4 rma9999 XXXXXX, ADAM 3976 N VCSG-4 716-555-4281 alb9999 XXXXXXX, EDWARD 4637 N VCSG-2 716-555-4780 esb9999 XXXXXXX, JOHN 1906 N VCSG-4 716-555-4780 XXXX, DERRICK 6432 N VCSG-2 716-555-3161 dxc9999 XXXXXXXXX, JOHN 5034 N VCSG-2 716-555-3894 jak9999 XXX, JASON 9020 N VCSG-2 716-555-3145 jsl9999 XXXXXXX, SARAH 7610 N VCSG-2 716-555-3147 sem9999 XXXXXXXX, CHRISTOPHER 6309 N VCSG-2 716-555-3427 cco9999 XXXXXXX, MICHAEL 8195 N VCSG-2 716-555-3166 mpp9999 XXXXXX, SHAUN 9925 N VCSG-2 716-555-3145 sls9999 XXXXXX, WILLIAM 2568 N VCSG-2 716-555-3144 wjw9999 XXXXXX, PATRICK 2335 N EECC-2 716-555-3144 psw9999 Regular Expressions

  16. XXXXXXX, EDWARD 4637 N VCSG-2 716-555-4780 esb9999 Match 1 or more non-comma characters Match 1 or more non-whitespace characters Match 4 digits Match 0 or more non-whitespace characters (the fields may not be in the input Match anything!! Roster to CSV while(<>) { ($last,$first,$id,$ntid,$gradeType,$program,$phone,$email)= /([^,]+), (\S+) (\d{4}) (\S*) (\S*) (\S+) (\S*) (\S*).*/; print "\"$last,$first\",$id,$program,$email\@cs.rit.edu\n"; } Regular Expressions

  17. The Result "XXXX,ROBERT",4653,VCSG-4,rma9999@cs.rit.edu "XXXXXX,ADAM",3976,VCSG-4,alb9999@cs.rit.edu "XXXXXXX,EDWARD",4637,VCSG-2,esb9999@cs.rit.edu "XXXXXXX,JOHN",1906,VCSG-4,@cs.rit.edu "XXXX,DERRICK",6432,VCSG-2,dxc9999@cs.rit.edu "XXXXXXXXX,JOHN",5034,VCSG-2,jak9999@cs.rit.edu "XXX,JASON",9020,VCSG-2,jsl9999@cs.rit.edu "XXXXXXX,SARAH",7610,VCSG-2,sem9999@cs.rit.edu "XXXXXXXX,CHRISTOPHER",6309,VCSG-2,cco9999@cs.rit.edu "XXXXXXX,MICHAEL",8195,VCSG-2,mpp9999@cs.rit.edu "XXXXXX,SHAUN",9925,VCSG-2,sls9999@cs.rit.edu "XXXXXX,WILLIAM",2568,VCSG-2,wjw9999@cs.rit.edu "XXXXXX,PATRICK",2335,EECC-2,psw9999@cs.rit.edu Regular Expressions

  18. What Can We Do? • /pattern/ • m/pattern/ • Find an occurrence of pattern • s/pattern/replacement/ • Replace an occurrence of pattern with replacement • All of these work on $_ • ‘m’ and ‘s’ can be followed with a ‘g’ which says to do the operation globally Regular Expressions

  19. split/join • split() can be used to break a string into fields $line = “merlin::118:10:Randal:/home/meryln:/usr/bin/perl”; @fields = split(/:/,$line); • join() can be used to glue them back together • $outline = join( “:”,@fields); Regular Expressions

  20. =~ • If you want to apply the matching operators to something other than $_ use =~ • $line =~ /foo/; • $line =~ /^ACCESSION/; • $line =~ s/ACCESSION(\s)*//; Regular Expressions

  21. accession.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^ACCESSION/ ) { $line =~ s/ACCESSION(\s)*//; print $line,"\n"; } } close INFILE; Regular Expressions

  22. Your Turn!!! • Modify the script accession.perl so that in addition to the accession number it also prints out the locus and the organism • The output from your program should look like this Accession: AF165912 Organism: Arabidopsis thaliana Locus: AF165912 5485 bp DNA linear PLN 29-JUL-1999 • The information obtained from the file must be printed in the order specified Regular Expressions

  23. accession.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^ACCESSION/ ) { $accession = $line; $accession =~ s/ACCESSION(\s)*//; } elsif ( $line =~ /^SOURCE/ ) { $organism = $line; $organism =~ s/SOURCE(\s)*//; } elsif ( $line =~ /^LOCUS/ ) { $locus = $line; $locus =~ s/LOCUS(\s)*//; } } print "Accession: ",$accession; print "Organism: ",$organism; print "Locus: ",$locus; close INFILE; Regular Expressions

  24. sequence.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; $in_sequence = 0; # 0 is false while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^\/\/\n/ ) { $in_sequence = 1; # non-zero value is true } elsif ( $line =~ /^ORIGIN/ ) { $in_sequence = true; } elsif ( $in_sequence ) { print $line,"\n"; } } close INFILE; Regular Expressions

  25. Your Turn!!! • Modify the script sequence.perl so that the sequence data is placed into a scalar variable named $sequence_data. • All of the spaces, newlines, and line numbers should be removed from the sequence data • Write your program so that it prints $sequence_data to verify it works correctly • Testing hint • Modify the Genbank record so that it has many fewer sequence lines Regular Expressions

  26. sequence.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; $in_sequence = 0; # 0 is false $sequence_data = ""; while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^\/\// ) { $in_sequence = 1; # non-zero value is true } elsif ( $line =~ /^ORIGIN/ ) { $in_sequence = true; } elsif ( $in_sequence ) { $sequence_data = $sequence_data . $line; } } $sequence_data =~ s/[\s0-9]//g; print $sequence_data; close INFILE; Regular Expressions

  27. Where Are We? • So now we have a program that extracts the DNA sequence information out of a Genbank record • Lets go one step further and convert the sequence data into an equivalent sequence of amino acids • To simplify things assume that a reading frame starts with the first nucleotide in the sequence Regular Expressions

  28. Hashes • Associative arrays • Key value pairs • Given the key, the table returns the value • You can build them by hand %codons = (‘TCA’,‘S’,‘TCC’,‘S’,…) Alternatively… %codons = ( ‘TCA’ => ‘S’, ‘TCC’ => ‘S’, ‘TCG’ => ‘S’, … ); Regular Expressions

  29. Building a Hash • The hash can be built under program control • The file codons.txt contains 64 lines in the following format • TCA S Serine • A perl script could read the lines one at a time • Break extract the codon and the amino acid from the line • Add the information to the hash Regular Expressions

  30. Hash Operations • Some operations you can perform on a hash $codons{‘TCA’}=‘S’; • Adds the entry (TCA,S) to the hash delete $codons{‘XXX’}; • Removes the key XXX from the hash keys %codons • Returns a list of the keys associated with the hash sort keys %codons • Sorts the keys in the hash Regular Expressions

  31. buildHash.perl open CODONS,"<$ARGV[0]" or die "Unable to open codon file ($!)"; while (<CODONS>) { ($codon,$amino,$name)=split /\s/; $codons{$codon}=$amino; } foreach $d ( sort keys %codons ) { print "$d: $codons{$d}\n"; } close CODONS; Regular Expressions

  32. Now This is Cool!! $dna = ‘CGACGTTTCGTACGGACTAGCT’; $amino_acids = “”; for ($i=0; $i<length($dna)-2; $i=$i+3) { $amino_acids .= $codons{ substr( $dna, $i, 3 ) }; } print $amino_acids,”\n”; Regular Expressions

  33. Homework Assignment • Write a perl script that will read sequence data from a Genbank record and will print out the amino acid sequence for each of the 6 reading frames associated with the sequence data in the record • Take advantage of subroutines • Clearly translate will be helpful • If a codon cannot be translated print ? In the output Regular Expressions

More Related