240 likes | 451 Views
Lecture 6. More advanced Perl…. Substitute. Like s/// function in vi: #cut with EcoRI and chew back $linker = “GGCCAATTGGAAT”; $linker =~ s/ CAATTG / CG /g ; Or… to get number of sites also: $ecosites = ($linker =~ s/ CAATTG / CG /g );. Reverse and Translate. my $DNA = ‘CCGTAA’;
E N D
Lecture 6 More advanced Perl…
Substitute • Like s/// function in vi: #cut with EcoRI and chew back $linker = “GGCCAATTGGAAT”; $linker =~ s/CAATTG/CG/g; • Or… to get number of sites also: $ecosites = ($linker =~ s/CAATTG/CG/g);
Reverse and Translate my $DNA = ‘CCGTAA’; $DNA =~ tr/ACGT/TGCA/; print “$DNA”; $DNA = reverse $DNA; print “ $DNA\n”; • Could have been made with DNA in mind…
Also, quick way to calc. GC% my $DNA = ‘CCGTAA’; my $gc = ($DNA =~ tr/CG/GC/); my $at = ($DNA =~ tr/AT/TA/); print (($gc/($gc+$at))*100); $DNA = reverse $DNA; print "\n$DNA\n";
push, shift, unshift and pop @DNA = ( “A", “C", “G", “T" ); • Add “A” to the END of @DNA push( @DNA, “A" ); • Remove “A” (or whatever is there) from the END of@DNA $end = pop( @DNA ); • Add “T" to the START of @DNA unshift( @DNA, “T" ); • Remove “T” (or whatever is there) from the START of @DNA $a = shift( @DNA );
Arguments • Arguments are data given to a function in UNIX or Perl, e.g. [matt@mrmarsh]$hmmpfam myprotein.fas Pfam.ls • You can get data into a Perl script with arguments [matt@mrmarsh]$myscript.pl myprotein.fas
Arguments, cont. • The arguments end up in a special array called @ARGV: my $file = (shift @ARGV); open FILE, “$file”;
Arguments, cont. • You can put as many arguments as you like in @ARGV – the number can be arbitrary my @data; foreach my $file (@ARGV){ open FILE, “$file”; while <FILE> { push $_, @data; } }
Index • Returns the positionof a match: • Takes three arguments: index string, substring, offset $linker = “GGCCAATTGGAAT”; while (($pos = index ($linker, “CAATTG”, $pos)) > -1) { print “EcoRI at $pos\n”; $pos ++; }
Bioperl • Bioperl is a HUGE set of ready-made perl programs that do almost all the jobs you need for bioinformatics. • Examples – recover DNA sequence from a website, translate DNA to protein, read GenBank files, convert to FASTA, parse BLAST output files into spreadsheets…
Bioperl, cont. • Unfortunately, there is a downside. Bioperl is extremely complex and very difficult to use. Also, the code is only as good as the people who wrote it. • Still, it can save you an awful lot of time. But in order to use it, you need to learn the Perl syntax for objects and references
Object syntax • Perl can be used as an object oriented programming language, although this isn’t enforced (as with Java or C++). • Bioperl is an object oriented set of modules • You pass Bioperl either a function call or a reference. It will return an object. You need to know what to do with the object when you get it, or Bioperl isn’t much use.
References and dereferences • Hashes, arrays and also variables can be big – don’t always want to duplicate them • Can pass a reference to these structures to another function. This is a variable that tells the code where to find a variable, hash or array without duplicating it. my $reference = \$DNA; my $data = ${$DNA}; objects can also be dereferenced by the dereference operator: ->
References and dereferences • Magically, because a reference is a variable, you can fill a hash or an array with references. • This is very useful for spreadsheet-type matrix data: my @data; #just a normal array open FILE, $spreadsheet or die $!; while (<FILE>) { my @line = split “\t”, $_; push @data, \@line; }
References and dereferences And then, to get data back, just dereference: foreach (@data) { foreach (@{$_}){ print "$_\t"; } }
References and dereferences • Also, you can make a hash of arrays: my %fileshash; foreach my $file (@filelist) { open FILE, $file or die $!; my @lines = (<FILE>); close FILE; $fileshash{$file} = \@lines; }
References and dereferences • And then, you can get back any line of any file: foreach (keys %fileshash){ print join “\n”, @{$fileshash{$_}}; } #or my $file = $ARGV[0]; my $line = $ARGV[1]; print ${$fileshash{$file}}[$line]
References and dereferences • And of course you can also make an array of hashes: foreach my $file (@filelist) { open FILE, $file or die $!; my %quesandans; while (<FILE>) { /([^t]+)\t(.+)/; $quesandans{$1} = $2; } push @hashrefarray, \%quesandans; }
References and dereferences • And get the data back in a similar way: foreach (@hashrefarray) { print “answer for $question is “; print ${$_{$question}}; }
References and dereferences • This, of course, leads to a very flexible and powerful set of data structures, since you can go as deep as you like: • Hashes of hashes of arrays • Arrays of hashes of hashes of hashes • etc. When they get this complicated, the dereference notation -> starts to get useful.
Bioperl: Example • Open a FASTA format sequence file: use Bio::Perl; use strict; my $file = $ARGV[0]; die "give me a sequence filename!\n" unless $file; my @seq_object_array = read_all_sequences($file,'fasta'); • The read_all_sequences function returns all the sequences in the fasta file as an array of object references
Bioperl: Example • The sequences from the file are now all in a long string, which can be accessed by dereferencing foreach my $object (@seq_object_array) { my $sequence = uc ($object->seq()); my $name = $object->display_id; my $pos =0; while ((my $pos = index ($sequence, “CAATTG”, $pos)) > -1) { print “EcoRI at $pos of $name\n”; $pos ++; } }
More Bioperl • I could spend a whole semester on Bioperl, but I won’t. You are going to have to figure it out for yourselves if you need it. • perldoc bioperl • I recommend going through the example script bptutorial.pl. I have downloaded and put this in your home directories.
That’s it for now • The more you know, the more there is to learn • The only way to really learn this stuff is to write programs • You need to get cracking with some programming projects in class!