MBG 8680 Perl Workshop

MBG 8680 Perl Workshop http://www.bioinformatics.wayne.edu Presented By: Daniel Liu ( danliu@genetics.wayne.edu )

What is Perl? • Perl is a versatile programmling language • Perl has no cost… • Runs on Unix, Linux, Mac, Windows and more • Perl binaries and source are downloadable from http://www.perl.com • “An installation demo”

A very simple example #!/usr/local/bin/perl # So a unix system knows where to find perl # Example of a comment line # Scaler variables begin with $ $string = "This is a string"; $number = 42; print "What is your name?"; $name = <STDIN>; chop($name); if ($name eq "Dan") { print "Hello Dan, your secret number is $number"; } else { print "Hello $name, you don't have a secret number"; }

Strings • Strings can use single or double quotes • Double quotes allow for variables to be interpolated • We can easily represent DNA as strings • $DNA = ‘ACGGGCATGCAAGCTAT’; • print $DNA; • It is easy to concatenate, translate, search, replace, and generate random strings. • “A transcription example”

Numbers • Numbers can be: • Integers like 0, 4, -5 • Floating point (decimal) like 3.14 • Scientific (exponential) like 6.02E23 • Different bases like hex, octal and binary

Arrays • Arrays are ordered collections of values indexed by position starting with 0 • Arrays are indicated by the ‘at sign’ @ • @myarray = (‘ACGT’, $fragment2, ‘GGCAA’); • In this case, myarray[1] = $fragment2 • Arrays can be copied • @newarray = @myarray; • Arrays elements can be counted • $elements = @myarray; • Arrays can be sorted • @sorted = sort(@myarray);

Regular Expressions • Regular expressions are used for pattern matching • $DNA = ‘ACGTGTCGATCGT’; • if ($DNA =~ /GATC/) { • print $&; • } • We call “=~” the binding operator: apply right on left • Will print ‘GATC’ if the substring is found • Can also be used for filtering • # Remove white space in $string • $string =~ s/\s//g;

Programming Loops • if (condition) { } else { } • if (condition) { } elsif { } else { } • while (condition) { } • for ( init; test; incr ) { } • example: for ($i=0; $i < 10; $i++) { } • foreach $i (@myarray) { }

Programming Functions (Subroutines) • Best to think of as a program within a program • Sub addACGT { • $dna = @_; • $dna .= ‘ACGT’; • return $dna; • } • Variables declared with ‘my’ are localized and can not be accessed outside that function • Pass by value vs. Pass by reference

Debugging • To start with debug mode, use perl –d • To list the help menu, type ‘h’ or ‘man perldebug’ • Useful commands: • Printing a variable: p $dna • Next line: n • Single Step (enters subroutines): s • Display a window of code: w • To set a breakpoint: b • To continue: c • To watch a variable: W

Modules and BioPerl • Why reinvent the wheel when someone has written a module that does what you want? • http://www.cpan.org • We’ll see examples of some later… • Bioperl is a toolkit of perl modules useful in building bioinformatics solutions in perl • http://www.bioperl.org • Useful for parsing sequences, report parsing, annotations, running external programs and more!

Beginning Perl for Bioinformatics Functions • On the genetics server, the BeginPerlBioinfo.pm module has been installed and made available • use BeginPerlBioinfo; • This modules includes all the functions defined in the book “Beginning Perl for Bioinformatics” by James Tisdall • You can view this file on genetics.wayne.edu • /usr/local/lib/perl5/5.6.1/BeginPerlBioinfo.pm

Design Philosophy • See if someone else already solved the problem • Start with ‘pseudocode’ • Identify inputs, outputs, and design • Edit – Run – Revise (and Save) • Comment your code • When in doubt, use a debugger • Use test cases for correctness, logic and boundaries

Automating Tasks (automate) • Problem • Want to automate a repetitive task • Input • Sequences in FASTA format • Output • Protein Sequences • Design • Identify input files • Open and evaluate input • If input criteria not met, skip to the next input • Otherwise translate into protein

Mouse %GC Example (mouse) • Problem • Want to summarize data from a specific text format • Input • Map density table from Ensembl • Output • Summations of selected criteria (%gc) • Design • Open file and split on a delimiter • Keep a running total for %gc • When there is no more input, print the results in a human readable format

Automating NCBI Example (ncbi) • Problem • Want to download a large amount of NCBI records • Input • Search term (just as you would use at entrez) • Output • Genbank, fasta etc… • Design • eutils - http://www.ncbi.nlm.nih.gov/entrez/eutils • Customize example script to fit your query • Use the ‘at’ command to schedule the job to run on off-peak hours, otherwise you may be blocked!

Parsing Genbank Annotations (genbank) • Problem • Want to find which annotations contain a keyword • Input • Genbank records delimited by ‘//’ in one file • Output • List of byte offsets for the records with a hit • Design • Load search terms and annotations (per record) • For each record, see if any keywords are present • If found, print the offset of the record • If not found, process the next entry until finished

A Quick BioPerl Example • Problem • Want to parse Unigene clusters into a database • Input • Unigene records delimited by ‘//’ in one file • Output • Insert statements to be used with the database • Design • Can use Bio::Cluster::UniGene and Bio::ClusterIO • A while loop can parse each record • Use BioPerl functions to parse out the data • Write out each insert statement before processing the next record

More Uses • Web Automation (LWP) • This is a module that contains numerous subroutines for automating web tasks • Example include automatic processing of web forms, web spidering, and web mirroring • See ‘Perl & LWP’ by Sean Burke • Database Interaction (DBI/DBD) • These are modules used for interaction with a database such as Oracle or MySQL • See ‘Programming the Perl DBI’ by Tim Brunce • If you can imagine it, it is probably out there!

More Resources • ‘Learning Perl’ by Randal L. Schwartz • ‘Perl By Example’ by Elie Quigley • ‘Perl Cookbook’ by Christiansen & Torkington • ‘Beginning Perl for Bioinformatics’ by James Tisdall • http://www.perl.com for source and binaries • http://www.perldoc.com for Perl Documentation • http://www.bioperl.org for BioPerl • http://www.cpan.org for modules • http://cs.camden.rutgers.edu/perl/ for a tutorial • http://paul.rutgers.edu/~mcgrew/perltutor/ for a tutorial

Questions? • Dan Liu danliu@wayne.edu • Dan’s Top 5 Tips! • http://bioinformatics.org • http://www.bio-itworld.com/ • http://www.bio-mirror.net/ • http://workbench.sdsc.edu/ • http://gchelpdesk.ualberta.ca/news/news.php • Have a great summer!!!

MBG 8680 Perl Workshop

MBG 8680 Perl Workshop

Presentation Transcript

Perl

Perl

PERL

Perl

Perl Workshop

Perl

MBG-60

Perl

PERL

Perl

Perl

Perl

Perl

Perl

Perl