400 likes | 576 Views
Programming and Perl for Bioinformatics Part I. Why Write Programs?. Automate computer work that you do by hand - save time and reduce errors Run the same analysis on lots of similar data files = scale-up Analyze data, make decisions sort Blast results by e-value and/or species of best mach
E N D
Why Write Programs? • Automate computer work that you do by hand - save time and reduce errors • Run the same analysis on lots of similar data files = scale-up • Analyze data, make decisions • sort Blast results by e-value and/or species of best mach • Build a pipeline • Create new analysis methods
Why Perl? • Fairly easy to learn the basics • Many powerful functions for working with text: search & extract, modify, combine • Can control other programs • Free and available for all operating systems • Most popular language in bioinformatics • Many pre-built “modules” are available that do useful things
Get Perl • You can install Perl on any type of computer • Just log in - you don’t even need to type any command to make Perl active. • Download and install Perl on your own computer: www.perl.org
Programming Concepts • Program= a text file that contains instructions for the computer to follow • ProgrammingLanguage = a set of commands that the computer understands (via a “command interpreter”) • Input= data that is given to the program • Output = something that is produced by the program
Programming • Write the program (with a text editor) • Run the program • Look at the output • Correct the errors (debugging) • Repeat computers are VERY dumb -they do exactly what you tell them to do, so be careful what you ask for…
Strings • Text is handled in Perl as a string • This basically means that you have to put quotes around any piece of text that is not an actual Perl instruction. • Perl has two kinds of quotes - single ‘...’ and double “...” (they are different- more about this later)
Print • Perl uses the term “print” to create output • Without a printstatement, you won’t know what your program has done • You need to tell Perl to put a carriage return at the end of a printed line • Use the “\n” (newline) command • Include the quotes • The “\” character is called an escape - Perl uses it a lot
A Taste of Perl: print a message • hello_world.pl: Greet the entire world. #!/usr/bin/perl -w #greet the entire world $x = 6e9; print “Hello world!\n”; print “All $x of you!\n”; - command interpretation header - a comment - variable assignment statement } - function calls (output statements)
Variables • Up till now, we’ve been telling the computer exactly what to print. But in order for the program to generate what is printed, we need to use variables. • A scalar variable name starts with “$” • It can store either a string or a number.
Basic Syntax and Data Types • whitespace doesn’t matter to Perl. One can write all statements on one line • All Perl statements end in a semicolon “;” just like C • Comments begin with ‘#’ and Perl ignores everything after the # until end of line. • Example: #this is a comment • Perl has three basic data types: • scalar • array (list) • associative array (hash)
Variables • To be useful at all, a program needs to be able to store information from one line to the next • Perl stores information in variables • A scalar variable name starts with the “$” symbol, and it can store strings or numbers • Variables are case sensitive • Give them sensible names • Use the “=”sign to assign values to variables $one_hundred = 100 $my_sequence = “ttattagcc”
Scalars • Scalar variables begin with $ followed by an identifier • Example: $this_is_a_scalar; • An identifier is composed of upper or lower case letters, numbers, and underscore '_'. Identifiers are case sensitive (like all of Perl) • $progname = “first_perl”; • $numOfStudents = 4; • = ( “gets”) sets the content of $progname to be the string “first_perl” and $numOfStudents to be the integer 4
Scalar Values • Numerical Values • integer: 5, “3”, 0, -307 • floating point: 6.2e9, -4022.33 NOTE: all numerical values stored as floating-point numbers (“double” precision)
A program with variables #!/usr/bin/perl -w #this program uses variables containing numbers my $two = 2; my $three = $two + 1; print “\$two * \$three = $two * $three = ", ($two * $three); print "\n";
Do the Math • Mathematical functions work pretty much as you would expect: 4+7 6*4 43-27 256/12 2/(3-5) • Example #!/usr/bin/perl print "4+5\n"; print 4+5 , "\n"; print "4+5=" , 4+5 , "\n"; $myNumber = 88; • Note: use commas to separate multiple items in a print statement 4+5 9 4+5=9 What will be the output?
Scalar Values • String values • Example: $day = "Monday "; print "Happy Monday!\n"; print "Happy $day!\n"; print 'Happy Monday!\n'; print 'Happy $day!\n'; • Double-quoted: interpolates (replaces variable name/control character with it’s value) • Single-quoted: NO interpolation done (as-is) Happy Monday!<newline> Happy Monday!<newline> Happy Monday!\n Happy $day!\n What will be the output?
2 Length of the substring 0 String Manipulation Concatenation $dna1 = “ACTGCGTAGC”; $dna2 = “CTTGCTAT”; • juxtapose in a string assignment or print statement $new_dna = “$dna1$dna2”; • Use the concatenation operator ‘.’ $new_dna = $dna1 . $dna2; Substring $dna = “ACTGCGTAGC”; $exon1 = substr($dna,2,5); # TGCGT
Substitution DNA transcription: T U Substitution operator s/// : $dna = “GATTACATACACTGTTCA”; $rna = $dna; $rna =~s/T/U/g; #“GAUUACAUACACUGUUCA” =~ is a binding operator indicating to exam the contents of $rna for a match pattern; “g” : global Ex: Start with $dna =“gaTtACataCACTgttca”; and do the same as above. What will be the output?
Example • transcribe.pl: $dna ="gaTtACataCACTgttca"; $rna = $dna; $rna =~ s/T/U/g; print "DNA: $dna\n"; print "RNA: $rna\n"; • Does it do what you expect? If not, why not? • Patterns in substitution are case-sensitive! What can we do? • Convert all letters to upper/lower case ( preferred when possible ) • If we want to retain mixed case, use transliteration/translation operatortr/// $rna =~ tr/tT/uU/; #replace all t by u, all T by U
Case conversion $string = “acCGtGcaTGc”; Upper case: $dna = uc($string); # “ACCGTGCATGC” or$dna = uc $string; or$dna = “\U$string”; #\U : string directive Lower case: $dna = lc($string); # “accgtgcatgc” or$dna = “\L$string”; Sentence case: $dna = ucfirst($string) # “Accgtgcatgc” or$dna = “\u\L$string”;
Reverse Complement 5’-A C G T C T A G C . . . . G C A T-3’ 3’-T G C A G A T C G . . . . C G T A-5’ • Reverse: reverses a string $string = "ACGTCTAGC"; $string = reverse($string);"CGATCTGCA” • Complementation: use transliteration operator $string =~ tr/ACGT/TGCA/;
What’s Wrong? • $DNA = "ACGTCTAGC"; print "$DNA\n\n"; $revcom = reverse $DNA; # Next substitute all bases by their complements, # A->T, T->A, G->C, C->G $revcom =~ s/A/T/g; $revcom =~ s/T/A/g; $revcom =~ s/G/C/g; $revcom =~ s/C/G/g; # Print the reverse complement DNA onto the screen print "$revcom\n";
Optional, default 0 More on String Manipulation String length: length( $dna ) Index: #index STR,SUBSTR,POSITION index( $strand, $primer, 2 )
Flow Control Conditional Statements • parts of code executed depending on truth value of a logical statement “truth” (logical) values in Perl: false = {0, 0.0, 0e0, “”, undef}, default “” true = anything else, default 1 ($a, $b) = (75, 83); if ( $a < $b ) { $a = $b; print “Now a = b!\n”; } if ( $a > $b ) { print “Yes, a > b!\n” }# Compact
if/else/elsif • allows for multiple branching/outcomes $a = rand(); if ( $a < 0.25 ) { print “A”; } elsif ($a < 0.50 ) { print “C”; } elsif ( $a < 0.75 ) { print “G”; } else { print “T”; }
What’s a block? • In the case of an “if” statement: • If the test is true, execute all the command lines inside the { }brackets. If not, then go on past the closing } to the statements below. • You can also do stuff in a block over and over again using a loop.
Conditional Loops while ( statement ) { commands … } • repeats commands until statement is no longer true do { commands } while ( statement ); • same as while, except commands executed as least once • NOTE the “ ; ” after the while statement Short-circuiting commands: next and last • next; #jumps to end, do next iteration • last; #jumps out of the loop completely
While-Loop • Loops test a condition and repeat a block of code based on the result • while loops repeat while the condition is true $count = 1; while($count <= 10) { print “$count bottles of pop\n”; $count = $count +1; }; print “POP!\n”; [Try this program yourself]
for and foreach loops • Execute a code loop a specified number of times, or for a specified list of values • for and foreach are identical: use whichever you want Incremental loop (“C style”): for ( $i=0 ; $i < 50 ; $i++ ) { $x = $i*$i; print "$i squared is $x.\n"; } Loop over list (“foreach” loop): foreach $name ( "Billy", "Bob", "Edwina" ) { print "$name is my friend.\n"; }
Standard Input • To make the program do something, we need to input data. • The angle bracket operator (< >) tells Perl to expect input, by default from the keyboard. • Usually this is assigned to a variable print “Please type a number: ”; $num = <STDIN>; print “Your number is $num\n”;
chomp • When data is entered from the keyboard, the program waits for you to type the carriage return key. • But.. the string which is captured includes a newline (carriage return) at its end • You can use the chomp function to remove the newline character: print “Enter your name: ”; $name = <STDIN>; print “Hello $name, happy to meet you!\n”; chomp $name; print “Hello $name, happy to meet you!\n”;
Basic Data Types • Perl has three basic data types: • scalar • array (list) • associative array (hash)