340 likes | 509 Views
Introduction to Objects. Motivation. Vast (and growing) repositories of data available genomic (sequence and annotation) http://genome.ucsc.edu http://www.ensembl.org literature http://www.ncbi.nih.gov/pubmed expression (microarray) http://www.ncbi.nih.gov/geo/index.cgi?qb=pro
E N D
Motivation • Vast (and growing) repositories of data available • genomic (sequence and annotation) http://genome.ucsc.edu http://www.ensembl.org • literature http://www.ncbi.nih.gov/pubmed • expression (microarray) http://www.ncbi.nih.gov/geo/index.cgi?qb=pro Still being sorted out 100 * 22,000 = 2,200,000 data points locally • structure http://www.ncbi.nih.gov/Structure/ Exellexis (Geoff Duyk) does 1000/year
Structure MMDB (Molecular Modeling DB) currently contains 15 000 structure entries, corresponding to 35 000 chains and 50 000 3D domains. And oh by the way…. the information was only contained in the literature. Therefore, in some instances there is strong interconnection between these data resources (but there are also many false/erroneous relationships). http://nar.oupjournals.org/cgi/content/full/30/1/249?ijkey=iGxcC9hHh9esk&keytype=ref&siteid=nar
Structure Example • Web page pex5 http://www.ncbi.nih.gov/Structure • Structure viewer • Data file
What does this all have to do with "Bioinformatics Techniques" • Accessing data • most of this data resides in databases • extracting/accessing data • manual • programatically –write programs to perform operations on sets of data that could not be performed any other way • download data (mirror data), process, analyze • query databases directly (for smaller sets) • "spider" data – write a program that "follows" html links to access data being served in web pages (HGMD) • locally generated data to contribute to community • pipeline processing
caBIG (cancer Biomedical Informatics Grid)cabig.nci.nih.gov • The cancer Biomedical Informatics Grid, or caBIG, is an informatics infrastructure that is connecting teams of cancer and biomedical researchers together to enable them to better develop and share tools and data in an open environment with common standards. caBIG is creating a voluntary virtual network (or grid) that links individuals and institutions, both nationally and internationally, effectively forming a World Wide Web of cancer research. • caBIG will allow researchers to answer research questions more rapidly and efficiently, thereby promising to accelerate progress in all aspects of cancer research– from etiologic research to prevention, early detection and treatment. Ultimately, because caBIG will provide a common unifying force that facilitates progress in cancer research and care, the most important beneficiaries will be cancer patients and the public at large.
cancer Biomedical Informatics Grid (caBIG) -- another model (63 Centers 2007) Analytical Services Model lends itself well to ideas such as "sharing" of workflows, automatic capture of workflows and autodetection of data, tools and analytical resources. However there is still a large gap between theory and reality. Development in Isolation Workflow Data Tools
Modules – a brief intro • A Perl module is a collection of subroutines and variables that typically implements some common functionality – and "packaged" to facilitate reusability • Modules exist to perform some of these tasks • access databases (Ensembl) • read data formats (GenBank records, BLAST files) • special/custom processing • generate an image of a database
File Handles and "strict" use strict; my $fh = &open_file("gene.gb"); &parseFile($fh); sub open_file { my $file_handel; open(FH,$_[0]); $file_handel = FH; return($file_handel); } sub parseFile { my $sfh = $_[0]; my $line = <$sfh>; print "line = $line\n"; }
Bareword "FH" not allowed while "strict subs" in use at fileHandle.pl line 10. Execution of fileHandle.pl aborted due to compilation errors.
use strict; use IO::Handle; my $fh = IO::Handle->new(); open($fh,"gene.gb"); &parseFile($fh); &t2($fh); sub parseFile { my $sfh = $_[0]; my $line = <$sfh>; print "line = $line\n"; } sub t2 { my $sfh = $_[0]; my $line = <$sfh>; print "line = $line\n"; }
Module Example • Statistics::ChisqIndep • http://www.cpan.org/ • http://www.cpan.org/modules/01modules.index.html • http://www.cpan.org/authors/id/Y/YU/YUNFANG/Statistics-ChisqIndep-0.1.tar.gz
Install (my Mac) • Download, uncompress, untar • make Makefile.pl • make • make test • sudo make install • Note: requires admin privilege (sort of) • Also requires: Statistics::Distributions
Install (nxclient) – no admin privaledges • mkdir modules • Download, extract Statistics::Distributions • perl Makefile.PL PREFIX=/user/eng/tbraun/modules • (replace tbraun with your user ID) • make • make test • make install
Statistics::ChisqIndep • Download, extract • Edit Makefile.PL, add the following line: use lib "/user/eng/tbraun/modules/lib/perl5/site_perl/5.10.0"; Replace "tbraun" with YOUR userID perl Makefile.PL PREFIX="/user/eng/tbraun" make make install
Perldoc ChisqIndep Statistics::ChisqIndep The module to perform chi-square test of independence (a.k.a. contingency tables) Synopsis #example for Statistics::ChisqIndep use strict; use Statistics::ChisqIndep; use POSIX; # input data in the form of the array of array references my @obs = ([15, 68, 83], [23,47,65]); my $chi = new Statistics::ChisqIndep; $chi->load_data(\@obs); # print the summary data along with the contingency table $chi->print_summary(); #print the contingency table only $chi->print_contingency_table();
Output Added: use lib "/user/eng/tbraun/modules/lib/perl5/site_perl/5.10.0"; $ perl terry-test.pl Rows: 2 Columns: 3 Degree of Freedom: 2 Total Count: 301 Chi-square Statistic: 4.5639003430935 p-value: 0.10208 Warning: some of the cell counts might be too low. 1 2 3 rtotal 1 15 68 83 166 (20.96) (63.42) (81.62) 2 23 47 65 135 (17.04) (51.58) (66.38) ctotal 38 115 148 301
CPAN – more info cpan faqs Cpan search (though I have not had good luck with the search – example: chi square) perldoc perllocal (display installed modules) cpan (perl module to install modules) Ex: cpan –i Statistics::ChisqIndep
Example of a Module Invocation Arguments From unix: grep pattern file – grep is the application, pattern is an invocation argument, and file is an invocation argument grep foreach arrayArray.pl foreach $i (@numbers) # i becomes a reference to an array foreach $j (@$i) # @$ dereferences i (an reference to an aray) DEMO grep foreach arrayArray.pl
Invocation Arguments (command line parameters) Run program: program.pl -flag1 -flag2 param1 param2 Ex: translate.pl sequence 50 10 Special variable -- all command line parameters are automatically stored in array @ARGV. Behaves just like a “normal” array
Invocation Arguments in Perl • @ARGV variables -- captures invocation variables when program is executed from command line #!/usr/bin/perl "The invocation arguments were:\n"; foreach $i (@ARGV) { print "$i\n"; }
How do we implement something like 'grep' that uses invocation arguments #!/usr/bin/perl # # grep.pl # example of invocation arguments -- $pattern = shift @ARGV; # get the pattern $file = shift @ARGV; # get the file name open(FH,$file); while($line = <FH>) { if($line =~ m/$pattern/) { print "$line"; } }
@INC -- how Perl "finds" things • the "search path" is given in a special variable called @INC perl –V : (lots of stuff) @INC: /home/tabraun/perl/bioperl-live//ensembl/modules /home/tabraun/perl/bioperl-live//bioperl-0.7 /home/tabraun/perl/bioperl-live//ensembl-external/modules /home/tabraun/perl/bioperl-live//ensembl-lite/modules /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 . ( ^ "Dot")
Sample Output DEMO grep.pl foreach arrayArray.pl ./grep.pl foreach arrayArray.pl foreach $i (@numbers) # i becomes a reference to an array foreach $j (@$i) # @$ dereferences i (an reference to an aray)
Invocation Arguments The "-n" flag grep -n foreach arrayArray.pl 19:foreach $i (@numbers) # i becomes a reference to an array 21: foreach $j (@$i) # @$ dereferences i (an reference to an aray) grep foreach arrayArray.pl -n 19:foreach $i (@numbers) # i becomes a reference to an array 21: foreach $j (@$i) # @$ dereferences i (an reference to an aray)
#!/usr/bin/perl # # grep1.pl # example of invocation arguments -- $line_nums = 0; $pattern = 0; $file = 0; # this ASSUMES correct ORDER of arguments while($param = shift @ARGV) { if($param eq "-n") { $line_nums = 1; # flag to be used to generate line numbers } elsif(!$pattern){ $pattern = $param; } elsif (!$file){ $file = $param; } else { print "Error\n"; exit(1); } } $count = 0; open(FH,$file); while($line = <FH>) { $count++; if($line =~ m/$pattern/) { if($line_nums) { print "$count: $line"; } else { print "$line"; } } } Added Complexity DEMO grep1.pl foreach arrayArray.pl
Perl Module: Getopt::Long #!/usr/bin/perl # # grep2.pl # example of invocation arguments -- use Getopt::Long; my $line_nums = 0; my $pattern = 0; my $file = 0; &GetOptions(”n" => \$line_nums, # n is command line argument, no value, $line_nums set to 1 "e=s" => \$pattern, # e is command line argument, S for string, stored in $pattern "f=s" => \$file); # f is command line argument, S for string, stored in $file, i for integer $count = 0; open(FH,$file); while($line = <FH>) { $count++; if($line =~ m/$pattern/) { if($line_nums) { print "$count: $line"; } else { print "$line"; } } }
man Getopt::Long Getopt::Long(3) perl v5.8.1 Getopt::Long(3) Perl Programmers Reference Guide Perl Programmers Reference Guid 2003-09-02 NAME Getopt::Long - Extended processing of command line options SYNOPSIS use Getopt::Long; my $data = "file.dat"; my $length = 24; my $verbose; $result = GetOptions ("length=i" => \$length, # numeric "file=s" => \$data, # string "verbose" => \$verbose); # flag DESCRIPTION The Getopt::Long module implements an extended getopt function called GetOptions(). This function adheres to the POSIX syntax for command…
Getopt::Long • Parses command line arguments off of @ARGV $result = GetOptions ("length=i" => \$length, # numeric "file=s" => \$data, # string "verbose" => \$verbose); # flag length is command line argument, expecting integer ($length is variable) file is command line argument, expecting string ($data is variable) verbose is command line argument (no values) ($verbose is variable, gets a 1) “option variables”
Sample Output DEMO ./grep2.pl –n –f arrayArray.pl –e foreach ./grep2.pl -n -f arrayArray.pl -e foreach 19: foreach $i (@numbers) # i becomes a reference to an array 21: foreach $j (@$i) # @$ dereferences i (an reference to an aray)
Building Larger Programs Write a subroutine called: print_formated_sequence - input is a sequence string, and line length -output: 1 ATTTTTTTTT TTTTTTTTTT TTTTTTTTTT TTTTTTTTAc cccccccccc 50 51 cccccccccc cccccccccc cccccccccc cccccccccc gggggggggg 100 101 gggggggggg gggggggggg gggggggggg gggggggggg gggggggggg 150 151 gggggggggg gggggggggg gggggggggg gggggggggg g 191 Now I use this subroutine in 100's of programs Later determine that is annoying how the first line is offset from second, etc. To fix this, I have to edit 100's of programs Need a way to save the code ONCE, and share with 100's of programs
Sharing Code Library of sequence processing subroutines in a file called "sequence_pro.pl" These subroutines may be accessed by "do": do "sequence_pro.pl"; This is the same as if the subroutine was in the new program. formatted_seq.pl: #!/usr/bin/perl # formatted_seq.pl do "sequence_pro.pl"; $seq = "ATGCCCCCCCGGGCG"; &print_formatted_sequence($seq,50);
Packages, Modules, Libraries (software) Library -- generic term for shared code, subroutines, variables Package -- Perl declaration (often called a Perl Module Library, or Perl Module, or Module) do “filename.**” -- incorporate code, subroutines, variables into a program require “filename.**” -- incorporate code, subroutines, variables into a program (with checking of duplication) use FILE -- incorporate code that has been declared as a module (File.pm) require “MySeq.pl”; use Seq; # Seq.pm is a module file