450 likes | 474 Views
BioPerl . Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu. Introduction. Objective of BioPerl: Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. Background: Started in 1995
E N D
BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu
Introduction • Objective of BioPerl: • Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. • Background: • Started in 1995 • One of the oldest open source Bioinformatics Toolkit Project
So what is BioPerl? • Higher level of abstraction • Re-usable collection of Perl modules that facilitate bioinformatics application development: • Accessing databases with different formats • Sequence manipulation • Execution and Parsing of the results of molecular biology programs • Catch? BioPerl does not include programs like Blast, ClustalW, etc • Uses system calls to execute external programs
So what is BioPerl? (continued…) • 551 modules (incl. 82 interface modules) • 37 module groups • 79,582 lines of code (223,310 lines total) • 144 lines of code per module • For More info: BioPerl Module Listing
Major Areas covered in Bioperl • Sequences, features, annotations, • Pairwise alignment reports • Multiple sequence alignments • Bibliographic data • Graphical rendering of sequence tracks • Database for features and sequences
Additional things • Gene prediction parsers • Trees, parsing phylogenetic and molecular evolution software output • Population genetic data and summary statistics • Taxonomy • Protein Structure
Downloading modules • Modules can be obtained from: • www.CPAN.org (Perl Modules) • www.BioPerl.org (BioPerl Modules) • Downloading modules from CPAN • Interactive mode • perl -MCPAN -e shell • Batch mode • use CPAN; • clean, install, make, recompile, test
Directory Structure • BioPerl directory structure organization: • Bio/ BioPerl modules • models/ UML for BioPerl classes • t/ Perl built-in tests • t/data/ Data files used for the tests • scripts/ Reusable scripts that use BioPerl • scripts/contributed/ Contributed scripts not necessarily integrated into BioPerl. • doc/ "How To" files and the FAQ as XML
Parsing Sequences • Bio::SeqIO • multiple drivers: genbank, embl, fasta,... • Sequence objects • Bio::PrimarySeq • Bio::Seq • Bio::Seq::RichSeq
Sequence Object Creation Sequence Creation : $sequence = Bio::Seq->new( -seq => ‘AATGCAA’ -display_id => ‘my_sequence’); Flat File Format Support : Raw, FASTA, GCG, GenBank, EMBL, PIR Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP
Sequence object • Common (Bio::PrimarySeq) methods • seq() - get the sequence as a string • length() - get the sequence length • subseq($s,$e) - get a subseqeunce • translate(...) - translate to protein [DNA] • revcom() - reverse complement [DNA] • display_id() - identifier string • description() - description string
Sequence Types Different Sequence Objects: • Seq – Some annotations • RichSeq – Additional annotations • PrimarySeq – Bare minimum annotation ( id, accession number, alphabet) • LocatableSeq – Start, stop and gap information also • LargeSeq – Very long sequences • LiveSeq – Newly sequenced genomes
Using a sequence use Bio::PrimarySeq; my $str = “ATGAATGATGAA”; my $seq = Bio::PrimarySeq->new(-seq => $str, -display_id=>”example”); print “id is “, $seq->display_id,”\n”; print $seq->seq, “\n”; my $revcom = $seq->revcom; print $revcom->seq, “\n”; print “frame1=”,$seq->translate->seq,“\n”; id is example ATGAATGATGAA TTCATCATTCAT trans frame1=MNDE
Accessing remote databases $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); $seq2 = $gb->get_Seq_by_acc('AF303112'); $seqio = $gb-> get_Stream_by_id(["J00522","AF303112","2981014"]);
Sequence – Accession numbers # Get a sequence from RefSeq by accession number use Bio::DB::RefSeq; $gb = new Bio::DB::RefSeq; $seq = $gb->get_Seq_by_acc(“NM_007304”); print $seq->seq();
Reading and Writing Sequences • Bio::SeqIO • fasta, genbank, embl, swissprot,... • Takes care of writing out associated features and annotations • Two functions • next_seq (reading sequences) • write_seq (writing sequences)
Writing a Sequence use Bio::SeqIO; # Let’s convert swissprot to fasta format my $in = Bio::SeqIO->new( -format => ‘swiss’, -file => ‘file.sp’); my $out = Bio::SeqIO->new( -format => ‘fasta’, -file => ‘>file.fa’);` while( my $seq = $in->next_seq ) { $out->write_seq($seq); }
Manipulating sequence data with Seq methods • Allows the easy manipulation of bioinformatics data • Specific parts of various annotated formats can be selected and rearranged. • Unwanted information can be voided out of reports • Important information can be highlighted, processed, stored in arrays for graphs/charts/etc with relative ease • Information can be added and subtracted in a flash
The Code #!/usr/local/bin/perl use Bio::Seq; use Bio::SeqIO; my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta'); my $seqobj = $seqin->next_seq(); my $seq = $seqobj->seq(),"\n"; #plain sequence print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n"; $seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines print $seq,"\n";
Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord) • Molecular Weight • Monomer Counter • Codon Counter • DNA weights • RNA weights • Amino Weights • More
The Code #!/usr/local/bin/perl use Bio::PrimarySeq; use Bio::Tools::SeqStats; my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1'); $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $hash_ref = $seq_stats->count_monomers(); foreach $base (sort keys %$hash_ref) { print "Number of bases of type ", $base, "= ",%$hash_ref->{$base},"\n"; }
More Code use SeqStats; $seq_stats = Bio::Tools::SeqStats->new($seqobj); $weight = $seq_stats->get_mol_wt(); -returns the molecular weight $monomer_ref = $seq_stats->count_monomers(); -counts the number of monomers $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence -counts the number of codons
Why the Large and The Small MW? Note that since sequences may contain ambiguous monomers (eg "M"meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wtreturns a two-element array containing the greatest lower bound andleast upper bound of the molecule. (For a sequence with no ambiguous monomers, the two elements of the returned array will be equal.)
Identifying restriction enzyme sites (Restriction Enzyme) • Bioperl's standard RestrictionEnzyme object comes with data for more than 150 different restriction enzymes. • To select all available enzymes with cutting patterns that are six bases long: • $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI'); @sixcutters = $re->available_list(6); • sites for that enzyme on a given nucleic acid sequence can be obtained using • $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI'); # $seqobj is the Seq object for the dna to be cut @fragments = $re1->cut_seq($seqobj);
Manipulating sequence alignments • Bioperl offers several perl objects to facilitate sequence alignment: • pSW (Smith-Waterman) • Clustalw.pm • TCoffee.pm • bl2seq option of StandAloneBlast.
Manipulating Alignments • Some of the manipulations possible with SimpleAlign include: • slice(): Obtaining an alignment ``slice'', that is, a subalignment inclusive of specified start and end columns. • column_from_residue_number(): Finding column in an alignment where a specified residue of a specified sequence is located. • consensus_string(): Making a consensus string. This method includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ``?'''s in the consensus • percentage_identity(): A fast method for calculating the average percentage identity of the alignment • consensus_iupac(): Making a consensus using IUPAC ambiguity codes from DNA and RNA.
The Code • use Bio::SimpleAlign; • $aln = Bio::SimpleAlign->new('t/data/testaln.fasta'); • $threshold_percent = 60; • $consensus_with_threshold = $aln->consensus_string($threshold_percent); • $iupac_consensus = $aln->consensus_iupac(); • # dna/rna alignments only • $percent_ident = $aln->percentage_identity; • $seqname = 'AKH_HAEIN'; • $pos = $aln->column_from_residue_number($seqname, 14);
Searching for Sequence Similarity • BLAST with BioPerl • Parsing Blast and FASTA Reports • Search and SearchIO • BPLite, BPpsilite, BPbl2seq • Parsing HMM Reports • Standalone BioPerl BLAST
Remote Execution of BLAST • BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm • Runs these jobs at NCBI automatically • NCBI has dynamic configurations (server side) to “always” be up and ready • Automatically updated for new BioPerl Releases • Convenient for independent researchers who do not have access to huge computing resources • Quick submission of Blast jobs without tying up local resources (especially if working from standalone workstation) • Legal Restrictions!!!
Example of Remote Blast $remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' ); $r = $remote_blast->submit_blast("t/data/ecolitst.fa"); while (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) { $rc = $remote_blast->retrieve_blast($rid); } }
Sample Script to Read and Parse BLAST Report # Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report); $result = $searchio->next_result; # Get info about the entire report $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hsp = $hit->next_hsp; $hsp_start = $hsp->query->start;
Running BLAST Locally • StandAloneBlast • Bio::Tools::Run::StandAloneBlast • Factory Objects • @params = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast->new(@params); • Advantages: • Private • Use Customized Local Resources • Avoid Network Problems
Examples • # Setting parameters similar to RemoteBlast$input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG"); • $blast_report = $factory->blastall($input); • # Blast Report Object that directly accesses parserwhile (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){ print $hsp->score . " " . $hsp->subject- >seqname . "\n"; } }
Format Conversion – Sequences Example • Use: Bio::SeqIO • Core Code: $in = Bio::SeqIO->new('-file' => "COG0001", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">COG0001.gen", '-format' => 'genbank'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }
Format Conversion – Alignments • Alignment formats supported: • INPUT: fasta, selex (HMMER), bl2seq, clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme • OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved). • Next_aln( ) and write_aln( ) methods of the ‘Bio::AlignIO’ object are used
ClustalW and Profile Align • ClustalW using BioPerl • Clustalw program should be installed and environment variable ‘CLUSTALDIR’ set • Setting Parameters – Build a factory • Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet‘ • Need reference to sequence array object (See example) • Align( ) and Profile_align( ) methods used
ClustalW – Example • Use Bio::SeqIO, Bio::Tools::Run::Alignment::Clustalw • Core code (Simple Align): @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1); $factory = Bio::Tools::Run::Alignment:: Clustalw->new(@params); $seq_array_ref = \@seq_array; $aln= $factory->align($seq_array_ref);
Smith Waterman Search • Smith Waterman pairwise alignment • Standard method for producing an optimal local alignment of two sequences • Auxilliary Bioperl-ext library required • SW algorithm implemented in C and incorporated into bioperl • Align_and_show() & Pairwise_alignment() in Bio::Tools::pSW module are methods used
Smith Waterman Search – Example • Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO • Core code: $factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2); $aln = $factory->pairwise_alignment($seq_array[0], $seq_array[1]); my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT); $alnout->write_aln($aln);
Smith Waterman Search • AlignIO object in previous slide – could also be used to print into a file • Use double loop to do all pairwise comparisons • More Info: Bio::Tools::pSW mapage