310 likes | 324 Views
Lecture 6. More advanced Perl…. Substitute. Like s/// function in vi: #cut with EcoRI and chew back $linker = “ GGCCAATTGGAAT ” ; $linker =~ s/ CAATTG / CG /g ; Or… to get number of sites also: $ecosites = ($linker =~ s/ CAATTG / CG /g );. Reverse and Translate.
E N D
Lecture 6 More advanced Perl…
Substitute • Like s/// function in vi: #cut with EcoRI and chew back $linker = “GGCCAATTGGAAT”; $linker =~ s/CAATTG/CG/g; • Or… to get number of sites also: $ecosites = ($linker =~ s/CAATTG/CG/g);
Reverse and Translate my $DNA = ‘CCGTAA’; $DNA =~ tr/ACGT/TGCA/; print “$DNA”; $DNA = reverse $DNA; print “ $DNA\n”; • Could have been made with DNA in mind…
Pseudocode $linker =~ s/CAATTG/CG/g; Substitute CAATTG with CG every time it occurs in $linker. $DNA =~ tr/ACGT/TGCA/; In $DNA, change every A to T, every C to G, every G to C and every T to A.
Also, quick way to calc. GC% my $DNA = ‘CCGTAA’; my $gc = ($DNA =~ tr/CG/GC/); my $at = ($DNA =~ tr/AT/TA/); print (($gc/($gc+$at))*100); $DNA = reverse $DNA; print "\n$DNA\n";
push, shift, unshift and pop @DNA = ( “A", “C", “G", “T" ); • Add “A” to the END of @DNA push( @DNA, “A" ); • Remove “A” (or whatever is there) from the END of@DNA $end = pop( @DNA ); • Add “T" to the START of @DNA unshift( @DNA, “T" ); • Remove “T” (or whatever is there) from the START of @DNA $a = shift( @DNA );
Arguments • Arguments are data given to a function in UNIX or Perl, e.g. [matt@mrmarsh]$hmmpfam myprotein.fas Pfam.ls • You can get data into a Perl script with arguments [matt@mrmarsh]$myscript.pl myprotein.fas
Arguments, cont. • The arguments end up in a special array called @ARGV: my $file = (shift @ARGV); open FILE, “$file”;
Arguments, cont. • You can put as many arguments as you like in @ARGV – the number can be arbitrary my @data; foreach my $file (@ARGV){ open FILE, “$file”; while <FILE> { push $_, @data; } }
Pseudocode foreach my $file (@ARGV){ open FILE, “$file”; while <FILE> { push $_, @data; } } For every argument supplied to the script as a command (i.e. in @ARGV) Open a file with that argument as the filename For every line in each file, put each line into the array @data as a separate array component
Index • Returns the positionof a match: • Takes three arguments: index string, substring, offset $linker = “GGCCAATTGGAAT”; while (($pos = index ($linker, “CAATTG”, $pos)) > -1) { print “EcoRI at $pos\n”; $pos ++; }
Pseudocode while (($pos = index ($linker, “CAATTG”, $pos)) > -1) { print “EcoRI at $pos\n”; $pos ++; } Starting at the position $pos (initially zero), look for the substring CAATTG in the string $linker. When you find it, print “EcoRI at (position found)”. Then, start looking again, beginning at the position you previously found CAATTG at, plus one base. Keep doing this until there are no more CAATTGs.
Bioperl • Bioperl is a HUGE set of ready-made perl programs that do almost all the jobs you need for bioinformatics. • Examples – recover DNA sequence from a website, translate DNA to protein, read GenBank files, convert to FASTA, parse BLAST output files into spreadsheets…
Bioperl, cont. • Unfortunately, there is a downside. Bioperl is extremely complex and very difficult to use. Also, the code is only as good as the people who wrote it. • Still, it can save you an awful lot of time. But in order to use it, you need to learn the Perl syntax for objects and references
Object syntax • Perl can be used as an object oriented programming language, although this isn’t enforced (as with Java or C++). • Bioperl is an object oriented set of modules • You pass Bioperl either a function call or a reference. It will return an object. You need to know what to do with the object when you get it, or Bioperl isn’t much use.
References and dereferences • Hashes, arrays and also variables can be big – don’t always want to duplicate them • Can pass a reference to these structures to another function. This is a variable that tells the code where to find a variable, hash or array without duplicating it. my $reference = \$DNA; my $data = ${$reference}; objects can also be dereferenced by the dereference operator: ->
Pseudocode my $reference = \$DNA; Define the contents of the variable $reference as a reference to the data contained in $DNA. my $data = ${$reference}; Define the contents of the variable $data as the contents of the variable referenced in $reference.
References and dereferences • Magically, because a reference is a variable, you can fill a hash or an array with references. • This is very useful for spreadsheet-type matrix data: my @data; #just a normal array open FILE, $spreadsheet or die $!; while (<FILE>) { my @line = split “\t”, $_; push @data, \@line; }
References and dereferences And then, to get data back, just dereference: foreach (@data) { foreach (@{$_}){ print "$_\t"; } }
Pseudocode push @data, \@line; Add a new component to the end of the array @data, which is a reference to the array @line foreach (@data) { foreach (@{$_}){ print "$_\t"; } } For each component in the array @data, look up all the components of the array referenced in that component. Then print out each component of the referenced array.
References and dereferences • Also, you can make a hash of arrays: my %fileshash; foreach my $file (@filelist) { open FILE, $file or die $!; my @lines = (<FILE>); close FILE; $fileshash{$file} = \@lines; }
References and dereferences • And then, you can get back any line of any file: foreach (keys %fileshash){ print join “\n”, @{$fileshash{$_}}; } #or my $file = $ARGV[0]; my $line = $ARGV[1]; print ${$fileshash{$file}}[$line]
References and dereferences • And of course you can also make an array of hashes: foreach my $file (@filelist) { open FILE, $file or die $!; my %quesandans; while (<FILE>) { /([^t]+)\t(.+)/; $quesandans{$1} = $2; } push @hashrefarray, \%quesandans; }
References and dereferences • And get the data back in a similar way: foreach (@hashrefarray) { print “answer for $question is “; print ${$_{$question}}; }
References and dereferences • This, of course, leads to a very flexible and powerful set of data structures, since you can go as deep as you like: • Hashes of hashes of arrays • Arrays of hashes of hashes of hashes • etc. When they get this complicated, the dereference notation -> starts to get useful.
Pseudocode For complex data structures, until you are used to visualizing what the names means, it pays to draw them out with arrows. For example: Note these are what we call anonymous arrays and variables
Bioperl: Example • Open a FASTA format sequence file: use Bio::Perl; use strict; my $file = $ARGV[0]; die "give me a sequence filename!\n" unless $file; my @seq_object_array = read_all_sequences($file,'fasta'); • The read_all_sequences function returns all the sequences in the fasta file as an array of object references
Bioperl: Example • The sequences from the file are now all in a long string, which can be accessed by dereferencing foreach my $object (@seq_object_array) { my $sequence = uc ($object->seq()); my $name = $object->display_id; my $pos =0; while ((my $pos = index ($sequence, “CAATTG”, $pos)) > -1) { print “EcoRI at $pos of $name\n”; $pos ++; } }
Unfortunately • Bio::Perl was the easy Bioperl • Not currently continuing • New Bioperl is all object oriented, eg. use Bio::Index::Fasta; use strict; my $Index_File_Name = shift; my $inx = Bio::Index::Fasta->new(-filename => $Index_File_Name, -write_flag => 1); $inx->make_index(@ARGV); my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT); foreach my $id (@ARGV) { my $seq = $inx->fetch($id); # Returns Bio::Seq object $out->write_seq($seq); }
More Bioperl • We could easily design a whole semester class on Bioperl, but I won’t. You are going to have to figure it out for yourselves if you need it. • My experience is that for simple problems it is almost always less work to write code yourself from scratch, at least until you get very good at this.
That’s it for now • The more you know, the more there is to learn • The only way to really learn this stuff is to write programs • You need to get cracking with some programming projects in class!