440 likes | 612 Views
Managing complexity (Advanced Perl). Using perl for specific tasks with help from Bioperl and others. Login. Username: bioinfouser Password: loginbioinfo. Funny?. Goals. I already assume you know perl basics -- some more advanced features Learn how to write OO code More flexible modules
E N D
Managing complexity(Advanced Perl) Using perl for specific tasks with help from Bioperl and others
Login • Username: bioinfouser • Password: loginbioinfo
Goals • I already assume you know perl basics -- some more advanced features • Learn how to write OO code • More flexible modules • Understand other modules • Some API’s that you may need. • Bioperl • PerlDBI
What I assume you already know • Scalars • Arrays • Hashes • Control structures (if-then, for, foreach, while, etc.) • File IO
Managing complexity • By managing complexity • Make hard tasks easy(er) • Perl itself does this • Regular expressions, text manipulations • Extensions (modules) do this • May come at the expense of execution speed • You may not care • Consider the big picture • Development time • Errors • Extremely custom software • Some things need speed
How complex is it now? • Perl is a very compact language in terms of human languages • Perl is large compared with other languages • TMTOWTDI • Perl has approximately 233 reserved words • Java has approximately 47 reserved words • Both are easy to learn harder to use effectively
General practices • Always use #!/usr/bin/perl –w or use warnings; • Consider use strict; for scripts longer than 10 lines • You can’t have too many comments • # • =head • =cut • perldoc
Getting values into the program or subroutine. • Perl is pass by value • A scalar can have as a value a “pointer” to an array, hash, function etc. • The args to a program or function arrive in a special variable called @_ • my $first_value = shift @_; • my $first_value = $_[1]; • my $first_value = shift;
References my @array = (“one”, “two”, “three”, “four”); function_call(@array); function_call(\@array); function_call([“one”,”two”,”three”]); sub function_call{ my $passed = shift @_; print $passed; } Output one ARRAY(0x80601a0) ARRAY(0x804c9a0)
Debugging complex data structures. • Print the reference • It will tell you a little bit of information • Use the Dumper module. • This will give you a snapshot of the whole data structure
Not Perl specific Very useful What they do: String comparisons String substitutions Substring selection Regular expressions
Regex Could put ‘m’ $string =~ /find/ $string =~ /find$/ $string =~ /^find/ $string =~ /^find$/ . Match any character \w Match "word" character (alphanumeric plus "_") \W Match non-word character \s Match whitespace character \S Match non-whitespace character \d Match digit character \D Match non-digit character \t Match tab \n Match newline \r Match return
Repetition $string =~ /(ti){2}/ $string =~ /A*T+G?C{3}A{3,}T{4,6}/ Character Classes $string =~ /[ATGCN]/ $string =~ /[^ATGCNatgcn]/i
Selection/Replacement $string =~ /(A{3,8})/; print $1; $string =~ s/a/A/ $string =~ tr/[atgc]/[ATGC]/
Additional syntax $string =~ /AT*?AT/ $string =~ m#/var/log/messages# $_ = “ATATATAGTGTGCGTGATATGGG”; ($one,$two,$three) =~ /AT..AT/g;
What is a module • Two types • Object-oriented type • Provides something similar to a class definition • Remote function call • Provides a method to import subroutines or variables for the main program to use
Howto: Making a module Create a file called workSaver.pm ########### package workSaver; sub doStuff { print “Stuff done\n”; } 1; #statement that evaluates to true ########### Now you can use with “use workSaver;”* *Some restrictions apply
Howto:Making a module cont. • This method would work very well for subroutines that are used in several programs. • Reduces the “clutter” in your program • Provides one maintenance point instead of unknown number. • Eases bug fixes • Careful of boundaries
More Complete method: • Allows you to “pollute” the namespace of the original program selectively. • Makes the use of functions and variables easier • Still used about the same way as the simple method but things are clearer
More Complete package functional; use strict; use Exporter; our @ISA = ("Exporter"); our @EXPORT = qw (); our @EXPORT_OK = qw ($variable1 $variable2 printout); our $VERSION = 2.0; our $variable1 = "var1"; our $variable2 = "var2"; my $variable3 = "var3"; sub printout { my $passed_variable = shift; print "Your variable is $passed_variable mine are $variable1 , $variable2, $variable3 \n"; } 1;
CPAN • Wouldn’t it be nice to have a place where: • You could find a bunch of perl modules • It would be brows able • Searchable • Big pipe for people to download stuff • Other people would be encouraged to submit fixes and updates • And it was all free
Sources of modules/Information • www.CPAN.org • www.bioperl.org • www.perl.com • www.cetus-links.org/oo_infos.html
Bioperl • Set of modules that are extremely useful for working with biological data. Actively maintained. • www.bioperl.org is a very good place to get the basics of bioperl • We will go through an example to see a typical use
Bioperl has several basic types of objects: • Seq: a sequence the most common type Bio::Seq • Location objects: where it is how long it is etc. • Interface objects: Bio::xyzI No implementation mostly a documentation
Bioperl documentation • Several different ways to find out about a module • perldoc Bio::Seq • bioperl.org/usr/lib/perl5/site_perl/5.8.0/bptutorial.pl 100 Bio::Seq • Data::Dumper to print the data structure • Print the variable
Why use a database • Transaction control - only one user can modify the data at any one time. • Access control - some people can modify data, some can read data, others can create data-structures. • Fast handling of lots of data • Precise definition of data (mostly). • Easy to share data resources with others
Many choices • There are many types: MS Access, Excel(sortof), sybase, oracle, postgres, msql, mysql … • They each have their niche and function best in certain cases, there is also considerable overlap. • SQL – structured query language is a common thread
MySQL is better than YourSQL • Free on Unix • Good developer support • Constant bug fixes and feature addition • Good scalability to medium size and load, OK performance. • Easy to install. • Used at Ensemble and UCSC genome browsers, so a lot of information is readily available in that format.
Table Structure - Schema Gene table Gene_ID Name Gene: ATP7B Aliases: Wilson disease-associated protein Copper-transporting ATPase 2 References: Enzyme Commission: 3.6.3.4 UniGene: Hs.84999 AffyProbeU133: 204624_at AffyProbeU95: 37930_at RefSeq: NM_000053 GenBank: AF034838 GenBank: U11700 LocusLink: 540 Alias table Alias_ID Gene_ID Alias Reference table Reference_ID Gene_ID Reference DataSource
SQL (MySQL dialect) • SELECT col_name FROM table WHERE col_name = value; • SELECT COUNT(*) FROM table WHERE col_name is like ‘%value%’; • SELECT count(distinct(col_name)) FROM table where col_name is not null; • CREATE, UPDATE, DELETE, INSERT have similar forms
SQL cont. • USE database_name • Also can be specified on the command line –D • SHOW TABLES – lists all the tables in that database (also SHOW DATABASES). • DESCRIBE table_name – lists the columns and datatypes for each column or SHOW COLUMNS FROM table_name
More advanced SELECTS SELECT (column_list) FROM (table_list) WHERE (constraints) GROUP_BY (grouping columns) ORDER_BY (sorting columns) LIMIT (limit number); SELECT col_name from (table1, table2) where table1_val = table2_val and table1_val2 > value; • Example of a equi-join
Getting the names right • If you only have one table you only need to use the column name • When you are using joins this may not be adequate. • If two tables have the column primary you would need to call the column table1.primary or table2.primary
Data Types • INT • Tinyint –128 to 127 • Smallint –32768 to 32767 • Mediumint –8388608 to 8388607 • Int –2147683648 to 2147483647 • Bigint –9223372036854775808 to 9223372036854775807 • FLOAT • Float 4 bytes • Double 8 bytes
CHAR • Char(n) character string of n n bytes • Varchar(n) character string up to n long L+1 bytes • Text upto 2^16 bytes • BLOBs Binary Large OBjects
Perl DBI • Method for perl to connect to a database (virtually any database) and read or modify data. • The statements are constructed very similar to SQL statements that would be entered on the command line so learning SQL is still necessary
Statements in DBI • Connect • Used to establish initial connection • Prepare • Prepare a statement to execute • Execute • Execute the statement • Do • prepare a statement that does not return results and execute it
Fetch • Several types used to get returned data • Disconnect • Disconnect from the server
Types of fetch • “fetchrow_array” • Used to fetch an array of scalars each time • Can also use “fetchrow_arrayref” • “fetchrow_hash” • Used to fetch a hash indexed by column name. • Slower but cleaner code. • Can also use “fetchrow_hashref”.
More advanced statements • Quote • Used to properly quote data for use with a prepare statement • “$value = $dbh->quote($blast_result);” • Placeholders • Speeds up execution, optional • my $prep = $dbh->prepare (“select x from y where z = ?”); • loop_start • $prep->bind_param(1,$z); • $prep->execute(); • loop_end