260 likes | 373 Views
Perl Programming for Biologists PART 2: Tue Aug 28 th 2007. Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center. To Dos. Close all programs other than IE on your laptop Log into virtual room YP: log into Safari. To Do - 2.
E N D
Perl Programming for BiologistsPART 2: Tue Aug 28th 2007 Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center
To Dos • Close all programs other than IE on your laptop • Log into virtual room • YP: log into Safari 2
To Do - 2 • Please download all class materials from http://lane.stanford.edu/howto/index.html?id=_2593 3
Class Focus for Session #2 • Converting file contents • Introducing BioPerl • Perl and relational databases And remember: Ask LOTS OF QUESTIONS 4
Cautions - Reminder • All examples pertain to MS Office 2003 • Unclear what is to be expected for MS Office 2007 • All contents pertain to Perl 5.x, not 6.x • V.5 and 6 are NOT compatible • V.5 is far far more common, so not much of an issue 5
Converting Data Stored in Flatfiles • Input: ExampleOutputExcel3.csv • File generated last week by Excel3.pl • Let’s look and run Convert1.pl →Convert5.pl 8
BioPerl: Overview • BioPerl = >1,000 modules divided into 7 packages • Not all in 1.4 • 1.4 = stable release 10
BioPerl: You Have A Friend In High Places The big deal: • BioPerl provides “objects” for various types of sequence data and their associated features and annotations. • These objects provide interfaces for analysis of these sequences with a wide variety of • external programs (BLAST, FASTA, clustalw and EMBOSS to name just a few). • various types of databases for storage and retrieval of sequences • remote (GenBank, EMBL etc) • local (MySQL, Flat_databases flat files, GFF etc.). 12
What A Biology-Related Program Looks Like When Coded According To The Object Paradigm t: Gene t: Protein t: DNA t: Organism t: Species t: RNA t: LivingObject t: Sequence 14
Derive an object from an existing object Create an object (“new”) Objects Inherit From A Class Or Prior Object Sequence RNA Protein Class = prototype for all objects of this type Object 1 (ancestor) Object2 DNA 15
Key BioPerl Links • BioPerl 1.4 installed as part of Perl 5.8.8.822 (what you downloaded) • BioPerl home: http://www.bioperl.org/wiki/Main_Page • http://www.bioperl.org/wiki/Getting_Started • Lots of examples 17
BioPerl Example: Querying GenBank To Retrieve Sequence Properties • Seq7.pl • Seq8.pl • Seq9.pl → after exercise (next slide) • Seq11.pl → after exercise (next slide) • Related docs: • GenBank search: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/GenBank.html • SeqIO: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/genbank.htmlSeqIO • See also http://www.bioperl.org/wiki/HOWTO:SeqIO • And most importantly: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html 18
Exercise: Print An Additional Sequence Feature • Add an additional sequence feature to Seq8.pl • What to print: see Methods for Seq object at http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html 19
Quiz Questions based on Seq11.pl use warnings; use strict; use Bio::DB::GenBank; # --------------------------------------------------------------------------- # main $| = 1; # Force unbuffered STDOUT and STDIN. my $gb = Bio::DB::GenBank->new( -format => 'GenBank', -seq_start => 0, -seq_end => 1000, -strand => 1, -complexity => 0); # put in some restrictions as to what is retrieved and stored into GenBank object ... # get a stream via a query string my $query = Bio::DB::Query::GenBank->new (-query =>'Homo sapiens[Organism] AND M-cadherin', -db => 'nucleotide'); my $seqio = $gb->get_Stream_by_query($query); my $i=0; # count total number of sequences while (my $seq = $seqio->next_seq) { print "seq id =", $seq->id, "\t version = ", $seq->version, "\t seq acc number = ", $seq->accession_number, "\t seq length = ", $seq->length,"\n"; $i++; } print "retrieved $i sequences from GenBank \n"; # -------------------------------------------------------------------------- 20
More Quizzing: Seq10.pl • Run Seq10.pl • Why the warning messages? • Specifying strands • 1 for plus • 2 for minus • Complexity: A GenBank nucleotide entry is often a part of a larger biological blob that contains other GI numbers (e.g., translated protein) • Complexity regulates the display:0 - get the whole blob1 - get the bioseq for gi of interest (default in Entrez)2 - get the minimal bioseq-set containing the gi of interest3 - get the minimal nuc-prot containing the gi of interest4 - get the minimal pub-set containing the gi of interest 21
Some Cautions • Be careful when querying databases • → have an idea of how many sequences you may be downloading/processing • Know that Perl might eat-up all of your CPU cycles 22
Preliminaries: Updating ODBC Manager • First we need to add directions to “GenesToEvaluate” DB to ODBC Manager • More at http://lane.stanford.edu/howto/index.html?id=_1751 24
Example Perl Programs That Interact With A Database Ancillary files: • ExampleOutputExcel3.csv needed as input to Access1.pl • Access2.pl and Access3.pl don’t need this file • All programs rely on GenesToEvaluate.mdb (Access DB) 25
In Closing: Suggestions • Modify the programs provided here • Baby steps… • Save often • Keep lots of prior versions so you can recover from your mistakes • SU provides lots of documentation → use it! • Google is invaluable 26