1 / 39

Introduction to Perl & BioPerl Dr G. P. S. Raghava Bioinformatics Centre IMTECH, Chandigarh

Introduction to Perl & BioPerl Dr G. P. S. Raghava Bioinformatics Centre IMTECH, Chandigarh Email: raghava@imtech.res.in Web: http://imtech.res.in/raghava/. Perl. Practical Extraction and Report Language Created by Larry Wall Runs on just about every platform

oneilv
Download Presentation

Introduction to Perl & BioPerl Dr G. P. S. Raghava Bioinformatics Centre IMTECH, Chandigarh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Perl & BioPerl Dr G. P. S. Raghava Bioinformatics Centre IMTECH, Chandigarh Email: raghava@imtech.res.in Web: http://imtech.res.in/raghava/

  2. Perl • Practical Extraction and Report Language • Created by Larry Wall • Runs on just about every platform • Most popular on Unix/Linux systems • Excellent language for file and data processing

  3. Simple Program On Unix, this is the location of the Perl interpreter #!/usr/local/bin/perl # This is a comment line. This program prints “Hello World.” # to the screen. print “Hello world.\n”; Comments start with # and end with the end of the line Newline character Program statements are terminated with semicolons

  4. Control Structures #!/usr/local/bin/perl # Print out 0,1,2,3,4,5,6,7,8,9 # in this case, $x is local only to the loop because my is used for (my $x = 0; $x < 10; $x++) { print “$x”; if ($x < 9) { print “,”; } } print “\n”;

  5. Control Structures #!/usr/local/bin/perl # Demonstrate the foreach loop, which goes through elements # in an array. my @users = (“bonzo”, “gorgon”, “pluto”, “sting”); foreach $user (@users) { print “$user is alright.\n”; }

  6. Functions • Use sub to create a function. • No named formal parameters, assign @_ to local subroutine variables. #!/usr/local/bin/perl # Subroutine for calculating the maximum sub max { my $max = shift(@_); # shift removes the first value from @_ foreach $val (@_) { $max = $val if $max < $val; # Notice perl allows post ifs } return $max; } $high = max(1,5,6,7,8,2,4,9,3,4); print “High value is $high\n”;

  7. Files • File handles are used to access files • open and close functions #!/usr/local/bin/perl # Open a file and print its contents to copy.txt my $filename = $ARGV[0]; open(MYFILE, “<$filename”); # < indicates read, > indicates write open(OUTPUT, “>copy.txt”); while ($line = <MYFILE>) { # The <> operator reads a line print OUTPUT $line; # no newline is needed, read from file } close MYFILE; # Parenthesis are optional

  8. Regular Expressions • One of Perl’s strengths is pattern matching • Perl’s regular expression language is extremely powerful, but can be challenging to learn • Some examples follow…

  9. Comma Separated Value Files #!/usr/local/bin/perl # Some simple code demonstrating how to use split and regular # expressions. This code extracts out values in a CSV file. my $filename = $ARGV[0]; open(INPUT, “<$filename”); while (<INPUT>) { chomp; # Remove terminating newline my @values = split /,/; # Split string in $_ where , exists print “The first value is: “ . $values[0] . “\n”; } close INPUT;

  10. Objects • Perl supports object oriented programming • Constructor name is new • A class is really a special kind of package. • Objects are created with bless

  11. Example Class Definition package Critter; # constructor sub new { my $objref = {}; # reference to an empty hash bless $objref; # make it an object in Critter class return $objref; # return the reference } # Instance method, first parameter is object reference sub display { my $self = shift; # just to demonstrate print “I’m a critter.\n”; } 1; # must end class with a true value Store in Critter.pm

  12. Example Object Usage #!/usr/local/bin/perl use Critter; my $critter = new Critter; # create an object $critter->display; # display the object display $critter; # alternative notation

  13. BioPerl (http://www.bioperl.org/) Defnition : It is a collection of perl modules that facilitate the development of perl script for bioinformatics. FACTS • Started in 1995 by open Bioinformatics Foundation • These are reusable codes (subroutine) • It does not include any ready to use program • It need basic knowledge of perl programming • Bioperl is an open source software that is still under active development • Biojava, Biophython, Biocorba, EMBOSS

  14. What is BioPerl? • An ‘open source’ project • http://bio.perl.org or http://www.cpan.org • A loose international collaboration of biologist/programmers • Nobody (that I know of) gets paid for this • A collection of PERL modules and methods for doing a number of bioinformatics tasks • Think of it as subroutines to do biology • Consider it a ‘tool-box’: • If you need a hammer, it is in there. • If you need a 7/16” hex-head wrench, you might need to code that yourself.

  15. What BioPerl isn’t • ‘Out of the box’ solutions to problems • You will have to know a little perl, and you will have to read documentation • Particularly well documented • Yes, there is documentation, but it can be difficult to see the ‘big picture’ - or sometimes the small picture • Fast • Generally the overhead involved in making perl objects (complicated ones) and loading in dependencies tends to make bioperl slow. • My own blast parser takes a small fraction of the time of the bioperl parser. Pro: mine is fast. Con: I had to write it and maintain it.

  16. Application of Bioperl • Accessing sequence data from local and remote databases • Transforming formats of database/ file records • Manipulating individual sequences • Searching for ``similar'' sequences • Creating and manipulating sequence alignments • Searching for genes and other structures on genomic DNA • Developing machine readable sequence annotation

  17. Accessing sequence data from Databases • Accessing data from remote databases • GenBank • Swissprot • EMBL • Indexing and accessing local databases • Create index for your file • Access the file by keyword

  18. Transforming formats of database/ file records This allows to convert your sequence/alignment from one format to another format (eg. Readseq) • Change sequence format • From GCG, PIR, FASTA etc. • To FASTA,GCG. PIR etc. • Change format of multiple sequence a;ignment • FASTA, CLUSTAL-W, GCG etc. • GCG, FASTA CLUSTAL-w etc.

  19. Manipulating sequence Bioperl contains many modules with functions for sequence analysis. • Sequence data manipulation: Display component, Reverse complement, Translate/Reverse translate from NT to Protein • Obtaining basic sequence statistics (Residues, MW, CODON) • Restriction enzyme mapping • Identifying amino acid cleavage sites • Manipulation by EMBOSS using BioPerl

  20. Searching for ``similar'' sequences • Running BLAST locally (Standalone BLAST) • Running BLAST remotely • Parsing BLAST and FASTA reports • Parsing HMM reports

  21. Creating and manipulating sequence alignments • Aligning 2 sequences with Smith-Waterman (pSW) • Aligning 2 sequences with Blast using bl2seq and AlignIO • Aligning multiple sequences (Clustalw.pm, TCoffee.pm) • Manipulating / displaying alignments (SimpleAlign)

  22. Example 1 #!/usr/local/bin/perl # Collect documents from PubMed containing the term “Breast Cancer” # and print them. use Bio::Biblio; my $biblio = new Bio::Biblio; my $collection = $biblio->find(“breast cancer”); while ($collection->has_next) { # there are underlines before next print $collection->get_next; }

  23. Example 2 #!/usr/local/bin/perl # Get a sequence from RefSeq by accession number use Bio::DB::RefSeq; $gb = new Bio::DB::RefSeq; $seq = $gb->get_Seq_by_acc(“NM_007304”); print $seq->seq();

  24. A simple script #!/opt/perl/bin/perl -w #bioperl_gb_fetch.pl use strict; use Bio::DB::GenBank; my $gb = Bio::DB::GenBank->new(); my $seq_obj = $gb->get_Seq_by_acc(‘AF303112’); print $seq_obj->display_id(), “\n”; print $seq_obj->seq(),”\n”;

  25. Changing Formats: #!/opt/perl/bin/perl -w #genbank_to_fasta.pl use Bio::SeqIO; my $input = Bio::SeqIO::new->(‘-file’ => $ARGV[0], ‘-format’ => ‘GenBank’); my $output = Bio::SeqIO::new->(‘-file’ => ‘>output.fasta’, ‘-format’ => ‘Fasta’); while (my $seq = $input->next_seq()){ $output->write_seq($seq) }

  26. Parsing Blast Reports • One of the strengths of BioPerl is its ability to parse complex data structures. Like a blast report. • Unfortunately, there is a bit of arcane terminology. • Also, you have to ‘think like bioperl’, in order to figure out the syntax. • This next script might get you started

  27. Parse BLAST output #!/opt/perl/bin/perl -w #bioperl_blast_parse.pl # program prints out query, and all hits with scores for each blast result use Bio::SearchIO; my $record = Bio::SearchIO->new(-format => ‘blast’, -file => $ARGV[0]); while (my $result = $record->next_result){ print “>”, $result->query_name, “ “, $result->query_description, “\n”; my $seen = 0; while (my $hit = $result->next_hit){ print “\t”, $hit->name, “\t”, $hit->bits, “\t”, $hit->significance, “\n”; $seen++ } if ($seen == 0 ) { print “No Hits Found\n” } }

  28. SearchIO parsers: • SearchIO can parse (and reformat) several formats containing alignment or similarity data: • blast • xml formatted blast (still a little wonky) • psi-blast • exonerate • WABA • FASTA • HMMER • The interface is the same for all of these, making your life a little easier.

  29. Some advanced BioPerl • What if you want to draw pretty images of blast reports? Bio::Graphics to the rescue: • Input - A blast report (single query): • Output: • Script in PERL13/blast_to_pngImage.pl

  30. Some really advanced BioPerl • What if you want to display an entire genome with annotation? Grow your own genome project • We can do that all in BioPerl (and a mySQL or Oracle database): • GBrowse - the Generic Genome Browser • Allows any feature to be displayed on a reference sequence • Many different styles of ‘Glyphs’, so all features can be drawn in a different style • Allows user to zoom in and out on reference sequence • User can select which features to display • User can upload their own features to be displayed on your reference sequence • And more!!! • http://www.bch.msu.edu/cgi-bin/gbrowse • http://www.gmod.org/

  31. GBrowse (Generic Genome Browser)

  32. BioPerl Pros and Cons • It can be a substantial investment in time to learn how to use bioperl properly • Once you get used to it, it is pretty simple to do some complicated things • There are a lot of nice tools in there, and (usually) somebody else takes care of fixing parsers when they break • BioPerl code is portable - if you give somebody a script, it will probably work on their system • BioPerl, unfortunately, has high overhead. Sometimes it is quicker and easier to write it yourself

  33. So what is BioPerl? (continued…) • 551 modules (incl. 82 interface modules) • 37 module groups • 79,582 lines of code (223,310 lines total) • 144 lines of code per module • For More info: BioPerl Module Listing

  34. mean 144 Some statistics

  35. Downloading modules • Modules can be obtained from: • www.CPAN.org (Perl Modules) • www.BioPerl.org (BioPerl Modules) • Downloading modules from CPAN • Interactive mode • perl -MCPAN -e shell • Batch mode • use CPAN; • clean, install, make, recompile, test

  36. Installing Modules • Steps for installing modules • Uncompress the module • Gunzip file.tar.gz • Tar –xvf file.tar • perl Makefile.PL • make • make test • make install

  37. Directory Structure • BioPerl directory structure organization: • Bio/ BioPerl modules • models/ UML for BioPerl classes • t/ Perl built-in tests • t/data/ Data files used for the tests • scripts/ Reusable scripts that use BioPerl • scripts/contributed/ Contributed scripts not necessarily integrated into BioPerl. • doc/ "How To" files and the FAQ as XML

  38. Platforms • HP-UX / Itanium ('ia64') 11.0 • HP-UX / PA-RISC 11.0 • MacOS • MacOS X • CygWin NT_5 v. 1.3.10 on Windows 2000 5.00 • Win32, WinNT i386 • IRIX64 6.5 SGI • Solaris 2.8 UltraSparc • OpenBSD 2.8 i386 • RedHat Linux 7.2 i686 • Linux i386

  39. References • Programming Perl by Wall, Christiansen, and Schwartz (O’Reilly) • Learning Perl by Schwartz and Phoenix (O’Reilly) • Beginning Perl for Bioinformatics by Tisdall (O’Reilly) • http://www.perl.com • http://www.bioperl.org

More Related