390 likes | 402 Views
Introduction to Perl & BioPerl Dr G. P. S. Raghava Bioinformatics Centre IMTECH, Chandigarh Email: raghava@imtech.res.in Web: http://imtech.res.in/raghava/. Perl. Practical Extraction and Report Language Created by Larry Wall Runs on just about every platform
E N D
Introduction to Perl & BioPerl Dr G. P. S. Raghava Bioinformatics Centre IMTECH, Chandigarh Email: raghava@imtech.res.in Web: http://imtech.res.in/raghava/
Perl • Practical Extraction and Report Language • Created by Larry Wall • Runs on just about every platform • Most popular on Unix/Linux systems • Excellent language for file and data processing
Simple Program On Unix, this is the location of the Perl interpreter #!/usr/local/bin/perl # This is a comment line. This program prints “Hello World.” # to the screen. print “Hello world.\n”; Comments start with # and end with the end of the line Newline character Program statements are terminated with semicolons
Control Structures #!/usr/local/bin/perl # Print out 0,1,2,3,4,5,6,7,8,9 # in this case, $x is local only to the loop because my is used for (my $x = 0; $x < 10; $x++) { print “$x”; if ($x < 9) { print “,”; } } print “\n”;
Control Structures #!/usr/local/bin/perl # Demonstrate the foreach loop, which goes through elements # in an array. my @users = (“bonzo”, “gorgon”, “pluto”, “sting”); foreach $user (@users) { print “$user is alright.\n”; }
Functions • Use sub to create a function. • No named formal parameters, assign @_ to local subroutine variables. #!/usr/local/bin/perl # Subroutine for calculating the maximum sub max { my $max = shift(@_); # shift removes the first value from @_ foreach $val (@_) { $max = $val if $max < $val; # Notice perl allows post ifs } return $max; } $high = max(1,5,6,7,8,2,4,9,3,4); print “High value is $high\n”;
Files • File handles are used to access files • open and close functions #!/usr/local/bin/perl # Open a file and print its contents to copy.txt my $filename = $ARGV[0]; open(MYFILE, “<$filename”); # < indicates read, > indicates write open(OUTPUT, “>copy.txt”); while ($line = <MYFILE>) { # The <> operator reads a line print OUTPUT $line; # no newline is needed, read from file } close MYFILE; # Parenthesis are optional
Regular Expressions • One of Perl’s strengths is pattern matching • Perl’s regular expression language is extremely powerful, but can be challenging to learn • Some examples follow…
Comma Separated Value Files #!/usr/local/bin/perl # Some simple code demonstrating how to use split and regular # expressions. This code extracts out values in a CSV file. my $filename = $ARGV[0]; open(INPUT, “<$filename”); while (<INPUT>) { chomp; # Remove terminating newline my @values = split /,/; # Split string in $_ where , exists print “The first value is: “ . $values[0] . “\n”; } close INPUT;
Objects • Perl supports object oriented programming • Constructor name is new • A class is really a special kind of package. • Objects are created with bless
Example Class Definition package Critter; # constructor sub new { my $objref = {}; # reference to an empty hash bless $objref; # make it an object in Critter class return $objref; # return the reference } # Instance method, first parameter is object reference sub display { my $self = shift; # just to demonstrate print “I’m a critter.\n”; } 1; # must end class with a true value Store in Critter.pm
Example Object Usage #!/usr/local/bin/perl use Critter; my $critter = new Critter; # create an object $critter->display; # display the object display $critter; # alternative notation
BioPerl (http://www.bioperl.org/) Defnition : It is a collection of perl modules that facilitate the development of perl script for bioinformatics. FACTS • Started in 1995 by open Bioinformatics Foundation • These are reusable codes (subroutine) • It does not include any ready to use program • It need basic knowledge of perl programming • Bioperl is an open source software that is still under active development • Biojava, Biophython, Biocorba, EMBOSS
What is BioPerl? • An ‘open source’ project • http://bio.perl.org or http://www.cpan.org • A loose international collaboration of biologist/programmers • Nobody (that I know of) gets paid for this • A collection of PERL modules and methods for doing a number of bioinformatics tasks • Think of it as subroutines to do biology • Consider it a ‘tool-box’: • If you need a hammer, it is in there. • If you need a 7/16” hex-head wrench, you might need to code that yourself.
What BioPerl isn’t • ‘Out of the box’ solutions to problems • You will have to know a little perl, and you will have to read documentation • Particularly well documented • Yes, there is documentation, but it can be difficult to see the ‘big picture’ - or sometimes the small picture • Fast • Generally the overhead involved in making perl objects (complicated ones) and loading in dependencies tends to make bioperl slow. • My own blast parser takes a small fraction of the time of the bioperl parser. Pro: mine is fast. Con: I had to write it and maintain it.
Application of Bioperl • Accessing sequence data from local and remote databases • Transforming formats of database/ file records • Manipulating individual sequences • Searching for ``similar'' sequences • Creating and manipulating sequence alignments • Searching for genes and other structures on genomic DNA • Developing machine readable sequence annotation
Accessing sequence data from Databases • Accessing data from remote databases • GenBank • Swissprot • EMBL • Indexing and accessing local databases • Create index for your file • Access the file by keyword
Transforming formats of database/ file records This allows to convert your sequence/alignment from one format to another format (eg. Readseq) • Change sequence format • From GCG, PIR, FASTA etc. • To FASTA,GCG. PIR etc. • Change format of multiple sequence a;ignment • FASTA, CLUSTAL-W, GCG etc. • GCG, FASTA CLUSTAL-w etc.
Manipulating sequence Bioperl contains many modules with functions for sequence analysis. • Sequence data manipulation: Display component, Reverse complement, Translate/Reverse translate from NT to Protein • Obtaining basic sequence statistics (Residues, MW, CODON) • Restriction enzyme mapping • Identifying amino acid cleavage sites • Manipulation by EMBOSS using BioPerl
Searching for ``similar'' sequences • Running BLAST locally (Standalone BLAST) • Running BLAST remotely • Parsing BLAST and FASTA reports • Parsing HMM reports
Creating and manipulating sequence alignments • Aligning 2 sequences with Smith-Waterman (pSW) • Aligning 2 sequences with Blast using bl2seq and AlignIO • Aligning multiple sequences (Clustalw.pm, TCoffee.pm) • Manipulating / displaying alignments (SimpleAlign)
Example 1 #!/usr/local/bin/perl # Collect documents from PubMed containing the term “Breast Cancer” # and print them. use Bio::Biblio; my $biblio = new Bio::Biblio; my $collection = $biblio->find(“breast cancer”); while ($collection->has_next) { # there are underlines before next print $collection->get_next; }
Example 2 #!/usr/local/bin/perl # Get a sequence from RefSeq by accession number use Bio::DB::RefSeq; $gb = new Bio::DB::RefSeq; $seq = $gb->get_Seq_by_acc(“NM_007304”); print $seq->seq();
A simple script #!/opt/perl/bin/perl -w #bioperl_gb_fetch.pl use strict; use Bio::DB::GenBank; my $gb = Bio::DB::GenBank->new(); my $seq_obj = $gb->get_Seq_by_acc(‘AF303112’); print $seq_obj->display_id(), “\n”; print $seq_obj->seq(),”\n”;
Changing Formats: #!/opt/perl/bin/perl -w #genbank_to_fasta.pl use Bio::SeqIO; my $input = Bio::SeqIO::new->(‘-file’ => $ARGV[0], ‘-format’ => ‘GenBank’); my $output = Bio::SeqIO::new->(‘-file’ => ‘>output.fasta’, ‘-format’ => ‘Fasta’); while (my $seq = $input->next_seq()){ $output->write_seq($seq) }
Parsing Blast Reports • One of the strengths of BioPerl is its ability to parse complex data structures. Like a blast report. • Unfortunately, there is a bit of arcane terminology. • Also, you have to ‘think like bioperl’, in order to figure out the syntax. • This next script might get you started
Parse BLAST output #!/opt/perl/bin/perl -w #bioperl_blast_parse.pl # program prints out query, and all hits with scores for each blast result use Bio::SearchIO; my $record = Bio::SearchIO->new(-format => ‘blast’, -file => $ARGV[0]); while (my $result = $record->next_result){ print “>”, $result->query_name, “ “, $result->query_description, “\n”; my $seen = 0; while (my $hit = $result->next_hit){ print “\t”, $hit->name, “\t”, $hit->bits, “\t”, $hit->significance, “\n”; $seen++ } if ($seen == 0 ) { print “No Hits Found\n” } }
SearchIO parsers: • SearchIO can parse (and reformat) several formats containing alignment or similarity data: • blast • xml formatted blast (still a little wonky) • psi-blast • exonerate • WABA • FASTA • HMMER • The interface is the same for all of these, making your life a little easier.
Some advanced BioPerl • What if you want to draw pretty images of blast reports? Bio::Graphics to the rescue: • Input - A blast report (single query): • Output: • Script in PERL13/blast_to_pngImage.pl
Some really advanced BioPerl • What if you want to display an entire genome with annotation? Grow your own genome project • We can do that all in BioPerl (and a mySQL or Oracle database): • GBrowse - the Generic Genome Browser • Allows any feature to be displayed on a reference sequence • Many different styles of ‘Glyphs’, so all features can be drawn in a different style • Allows user to zoom in and out on reference sequence • User can select which features to display • User can upload their own features to be displayed on your reference sequence • And more!!! • http://www.bch.msu.edu/cgi-bin/gbrowse • http://www.gmod.org/
BioPerl Pros and Cons • It can be a substantial investment in time to learn how to use bioperl properly • Once you get used to it, it is pretty simple to do some complicated things • There are a lot of nice tools in there, and (usually) somebody else takes care of fixing parsers when they break • BioPerl code is portable - if you give somebody a script, it will probably work on their system • BioPerl, unfortunately, has high overhead. Sometimes it is quicker and easier to write it yourself
So what is BioPerl? (continued…) • 551 modules (incl. 82 interface modules) • 37 module groups • 79,582 lines of code (223,310 lines total) • 144 lines of code per module • For More info: BioPerl Module Listing
mean 144 Some statistics
Downloading modules • Modules can be obtained from: • www.CPAN.org (Perl Modules) • www.BioPerl.org (BioPerl Modules) • Downloading modules from CPAN • Interactive mode • perl -MCPAN -e shell • Batch mode • use CPAN; • clean, install, make, recompile, test
Installing Modules • Steps for installing modules • Uncompress the module • Gunzip file.tar.gz • Tar –xvf file.tar • perl Makefile.PL • make • make test • make install
Directory Structure • BioPerl directory structure organization: • Bio/ BioPerl modules • models/ UML for BioPerl classes • t/ Perl built-in tests • t/data/ Data files used for the tests • scripts/ Reusable scripts that use BioPerl • scripts/contributed/ Contributed scripts not necessarily integrated into BioPerl. • doc/ "How To" files and the FAQ as XML
Platforms • HP-UX / Itanium ('ia64') 11.0 • HP-UX / PA-RISC 11.0 • MacOS • MacOS X • CygWin NT_5 v. 1.3.10 on Windows 2000 5.00 • Win32, WinNT i386 • IRIX64 6.5 SGI • Solaris 2.8 UltraSparc • OpenBSD 2.8 i386 • RedHat Linux 7.2 i686 • Linux i386
References • Programming Perl by Wall, Christiansen, and Schwartz (O’Reilly) • Learning Perl by Schwartz and Phoenix (O’Reilly) • Beginning Perl for Bioinformatics by Tisdall (O’Reilly) • http://www.perl.com • http://www.bioperl.org