450 likes | 793 Views
Introduction to Perl for Bioinformatics. Thursday, April 7. Programming languages. Self-contained language Platform-independent Used to write O/S C (imperative, procedural) C++, Java (object-oriented) Lisp, Haskell, Prolog (functional) Scripting language Closely tied to O/S
E N D
Introduction to Perlfor Bioinformatics Thursday, April 7
Programming languages • Self-contained language • Platform-independent • Used to write O/S • C (imperative, procedural) • C++, Java (object-oriented) • Lisp, Haskell, Prolog (functional) • Scripting language • Closely tied to O/S • Perl, Python, Ruby • Domain-specific language • R (statistics) • MatLab (numerics) • SQL (databases) • An O/S typically manages… • Devices (see above) • Files & directories • Users & permissions • Processes & signals
Bioinformatics “pipelines” often involve chaining together multiple tools
Perl is the most-used bioinformatics language Most popular bioinformatics programming languages Bioinformatics career survey, 2008 Michael Barton
Perl overview • Interpreted, not compiled • Fast edit-run-revise cycle • Procedural & imperative • Sequence of instructions (“control flow”) • Variables, subroutines • Syntax close to C (the de facto standard minimal language) • Weakly typed (unlike C) • Redundant, not minimal (“there’s more than one way to do it”) • High-level data structures & algorithms • Hashes, arrays • Operating System support (files, processes, signals) • String manipulation
Perl basics • Basic syntax of a Perl program: #!/usr/local/bin/perl# Elementary Perl program print "Hello World\n"; Lines beginning with "#" are comments, and are ignoredby Perl All statements end with a semicolon Single or double quotes enclose a "string literal" (double quotes are "interpolated") "\n" means new line print statement tells Perl to print the following text to the screen Hello World
$x = 3; print $x; $x = "ACGCGT"; print $x; 3 ACGCGT Variables • We can tell Perl to "remember" a particular value, using the assignment operator “=“: • The $x is referred to as a "scalar variable". Binding site for yeast transcription factor MCB Variable names can contain alphabetic characters, numbers (but not at the start of the name), and underscore symbols "_" Scalar variable names are all prefixed with the dollar symbol.
Arithmetic operations • Basic operators are + - / * % • Can also use += -= /= *= ++ -- $x = 14; $y = 3;print "Sum: ", $x + $y, "\n"; print "Product: ", $x * $y, "\n"; print "Remainder: ", $x % $y, "\n"; Sum: 17 Product: 42 Remainder: 2 $x = 5; print "x started as $x\n"; $x = $x * 2; print "Then x was $x\n"; $x = $x + 1; print "Finally x was $x\n"; Could write $x *= 2; x started as 5 Then x was 10 Finally x was 11 Could write $x += 1; or even ++$x;
String operations $a = "pan"; $b = "cake"; $a = $a . $b; print $a; $a = "soap"; $b = "dish"; $a .= $b; print $a; • Concatenation . and .= • Can find the length of a string using the function length($x) pancake soapdish $mcb = "ACGCGT"; print "Length of $mcb is ", length($mcb); Length of ACGCGT is 6
More string operations Convert to upper case $x = "A simple sentence"; print $x, "\n"; print uc($x), "\n"; print lc($x), "\n"; $y = reverse($x); print $y, "\n"; $x =~ tr/i/a/; print $x, "\n"; print length($x), "\n"; Convert to lower case Reverse the string Transliterate "i"'s into "a"'s A simple sentence A SIMPLE SENTENCE a simple sentence ecnetnes elpmis A A sample sentence 17 Calculate the length of the string
Concatenating DNA fragments $dna1 = "accacgt"; $dna2 = "taggtct"; print $dna1 . $dna2; accacgttaggtct "Transcribing" DNA to RNA DNA string is a mixture of upper & lower case $dna = "accACgttAGGTct"; $rna = lc($dna); $rna =~ tr/t/u/; print $rna; Make it alllower case Transliterate "t" to "u" accacguuaggucu
Conditional blocks • The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Perl, this looks like this:if (condition) { action } else { alternative } $x = 149; $y = 100; if ($x > $y) { print "$x is greater than $y\n"; } else { print "$x is less than $y\n"; } These braces {} tell Perl which piece of code is contingent on the condition. 149 is greater than 100
Conditional operators "does not equal" Note that the test for "$x equals $y" is $x==$y, not $x=$y • Numeric: > >= < <= != == • String: eq ne gt lt ge le $x = 5 * 4; $y = 17 + 3; if ($x == $y) { print "$x equals $y"; } 20 equals 20 "is alphabetically less-or-equal" "equals" "does not equal" "is alphabetically greater-or-equal" "is alphabetically greater than" Shorthand syntax for assigning more than one variable at a time "is alphabetically less than" ($x, $y) = ("Apple", "Banana"); if ($y gt $x) { print "$y after $x "; } Banana after Apple
if (1) { print "1 is true\n"; } if (0) { print "0 is true\n"; } if (-99) { print "-99 is true\n"; } 1 is true -99 is true $x = 222; if ($x % 2 == 0 and $x % 3 == 0) { print "$x is an even multiple of 3\n"; } 222 is an even multiple of 3 Logical operators • Logical operators: && means "and", || means "or" • An exclamation mark ! is used to negate what follows Thus !($x < $y) means the same as ($x >= $y) • In computers, the value zero is often used to represent falsehood, while any non-zero value (e.g. 1) represents truth. Thus:
The code insidethe braces is repeatedly executed as longas the condition$x<=10 remains true $x = 1; while ($x <= 10) { print $x, " "; ++$x; } 1 2 3 4 5 6 7 8 9 10 Loops • Here's how to print out the numbers 1 to 10: • This is a while loop.The code is executed while the condition is true. Equivalent to $x = $x + 1;
A common kind of loop • Let's dissect the code of the while loop again: • This form of while loop is common enough to have its own shorthand: the for loop. Initialization $x = 1; while ($x <= 10) { print $x, " "; ++$x; } Test for completion Continuation Continuation Initialization Test for completion for ($x = 1; $x <= 10; ++$x) { print $x, " "; }
defined and undef • The function defined($x) is true if $x has been assigned a value: • A variable that has not yet been assigned a value has the special value undef • Often, if you try to do something "illegal" (like reading from a nonexistent file), you end up with undef as a result if (defined($newvar)) { print "newvar is defined\n"; } else { print "newvar is not defined\n"; } newvar is not defined
Reading a line of data • To read from a file, we first need to openthe file and give it a filehandle. • Once the file is opened, we can read a single line from it into the scalar $x : open FILE, "sequence.txt"; This code snippet opens a file called"sequence.txt", and associates it with a filehandle called FILE This reads the next line from the file, including the newline at the end, "\n". if the end of the file is reached, $x is assigned the special value undef $x = <FILE>;
Reading an entire file • The following piece of code reads every line in a file and prints it out to the screen: • A shorter version of this is as follows: open FILE, "sequence.txt"; while (defined ($x = <FILE>)) { print $x; }close FILE; This reads a line of data into $x, then checks if $x is defined. If $x is undef, then the file must have ended. open FILE, "sequence.txt"; while ($x = <FILE>) { print $x; }close FILE; this is equivalent to defined($x=<FILE>)
The default variable, $_ • Many operations that take a scalar argument, such as length($x), are assumed to work on $_ if the $x is omitted: • So we can also read a whole file like this: $_ = "Hello"; print; print length; Hello5 open FILE, "sequence.txt"; while (<FILE>) { print; }close FILE; This line is equivalent to while (defined($_=<FILE>)) {
Summary: scalars and loops • Assignment operator • Arithmetic operations • String operations • Conditional tests • Logical operators • Loops • defined and undef • Reading a file $x = 5; $y = $x * 3; $s = "Value of y is " . $y; if ($y > 10) { print $s; } if ($y>10 && $s eq "") { exit; } for ($x=1; $x<10; ++$x) { print $x, "\n"; }
Pattern-matching • A more sophisticated kind of logical test is to ask whether a string contains a pattern • e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT? $name = "YBR007C"; $dna="TAATAAAAAACGCGTTGTCG"; if ($dna =~ /ACGCGT/) { print "$name has MCB!\n"; } 20 bases upstream of the yeast gene YBR007C The pattern binding operator =~ The pattern for the MCB binding site YBR007C has MCB!
FASTA format • A format for storing multiple named sequences in a single file • This file contains 3' UTRsfor Drosophila genes CG11604,CG11455 and CG11488 >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT>CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT Name of sequence is preceded by > symbol NB sequences can span multiple lines Call this file fly3utr.txt
Printing all sequence names in a FASTA database • The key to this program is this block: open FILE, "fly3utr.txt"; while ($x = <FILE>) { if ($x =~ />/) { print $x; } }close FILE; >CG11604 >CG11455 >CG11488 This pattern matches (and returns TRUE) if the default variable $_ contains the FASTA sequence-name symbol > if ($x =~ />/) { print $x; } This line prints $_ if the pattern matched
open FILE, "fly3utr.txt"; while (<FILE>) { if (/>/) { s/>//; print; } }close FILE; CG11604 CG11455 CG11488 New statement removes the ">" Pattern replacement $_ is the default variable for these operations • The new statement s/>// is an example of a replacement. • General form: s/OLD/NEW/ replaces OLD with NEW • Thus s/>// replaces ">" with "" (the empty string)
Finding all sequence lengths Start Open file Read line Print last sequence length yes no End of file? Remove “\n” newline character at end of line Stop yes no Line starts with “>” ? Sequence name Sequence data no yes Print last sequence length Add length of line to running total First sequence? Record the name Reset running total of current sequence length
Finding all sequence lengths open FILE, "fly3utr.txt"; while (<FILE>) { chomp; if (/>/) { if (defined $len) { print "$name $len\n"; } $name = $_; $len = 0; } else { $len += length; } } print "$name $len\n"; close FILE; The chomp statement trims the newline character "\n" off the end of the default variable, $_. Try it without this and see what happens – and if you can work out why >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT>CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT >CG11604 58 >CG11455 83 >CG11488 68
Reverse complementing DNA • A common operation due to double-helix symmetry of DNA Start by making string lower case again. This is generally good practice $dna = "accACgttAGgtct"; $revcomp = lc($dna); $revcomp = reverse($revcomp); $revcomp =~ tr/acgt/tgca/; print $revcomp; Reverse the string Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a' agacctaacgtggt
Arrays • An array is a variable holding a list of items • We can think of this as a list with 4 entries @nucleotides = ('a', 'c', 'g', 't'); print "Nucleotides: @nucleotides\n"; Nucleotides: a c g t a c g t the array is theset of all four elements element 0 Note that the element indices start at zero. element 3 element 1 element 2
Array literals • There are several, equally valid ways to assign an entire array at once. This is the most common: a comma- separated list, delimited by parentheses @a = (1,2,3,4,5); print "a = @a\n"; @b = ('a','c','g','t'); print "b = @b\n"; @c = 1..5; print "c = @c\n"; @d = qw(a c g t); print "d = @d\n"; a = 1 2 3 4 5 b = a c g t c = 1 2 3 4 5 d = a c g t
Accessing arrays • To access array elements, use square brackets; e.g. $x[0] means "element zero of array @x" • Remember, element indices start at zero! • If you use an array @x in a scalar context, such as @x+0, then Perl assumes that you wanted the length of the array. @x = ('a', 'c', 'g', 't'); print $x[0], "\n"; $i = 2; print $x[$i], "\n"; a g @x = ('a', 'c', 'g', 't'); print @x + 0; 4
Array operations • You can sort and reverse arrays... • You can read the entire contents of a file into an array (each line of the file becomes an element of the array) @x = ('a', 't', 'g', 'c'); @y = sort @x; @z = reverse @y; print "x = @x\n"; print "y = @y\n"; print "z = @z\n"; x = a t g c y = a c g t z = t g c a open FILE, "sequence.txt"; @x = <FILE>;
push, pop, shift, unshift @x = (‘A’, ‘T’, ‘W’); print "I started with @x\n"; $y = pop @x; push @x, ‘G’; print "Then I had @x\n"; $z = shift @x; unshift @x, ‘C’; print "Now I have @x\n"; print "I lost $y and $z\n"; pop removes the last element of an array push adds an element to the end of an array shift removes the first element of an array unshift adds an element to the start of an array I started with A T W Then I had A T G Now I have C T G I lost W and A
foreach • Finding the total of a list of numbers: • Equivalent to: @val = (4, 19, 1, 100, 125, 10); $total = 0; foreach $x (@val) { $total += $x; } print $total; foreach statement loops through each entry in an array 259 @val = (4, 19, 1, 100, 125, 10); $total = 0; for ($i = 0; $i < @val; ++$i) { $total += $val[$i]; } print $total; 259
The @ARGV array • A special array is @ARGV • This contains the command-line arguments when the program is invoked at the Unix prompt • It's a way for the user to pass information into the program
Exploding a sequence into an array • The programming language C treats all strings as arrays $dna = "accggtgtgcg"; print "String: $dna\n"; @array = split( //, $dna); print "Array: @array\n"; The split statement turns a string into an array. Here, it splits after every character, but we can also split at specific points, like a restriction enzyme String: accggtgtgcg Array: a c c g g t g t g c g
Taking a slice of an array • The syntax @x[i,j,k...] returns a (3-element) array containing elements i,j,k... of array @x @nucleotides = ('a', 'c', 'g', 't'); @purines = @nucleotides[0,2]; @pyrimidines = @nucleotides[1,3]; print "Nucleotides: @nucleotides\n"; print "Purines: @purines\n"; print "Pyrimidines: @pyrimidines\n"; Nucleotides: a c g t Purines: a g Pyrimidines: c t
Finding elements in an array • The grep command is used to select some elements from an array • The statement grep(EXPR,LIST) returns all elements of LIST for which EXPR evaluates to true (when $_ is set to the appropriate element) • e.g. select all numbers over 100: @numbers = (101, 235, 10, 50, 100, 66, 1005); @numbersOver100 = grep ($_ > 100, @numbers); print "Numbers: @numbers\n"; print "Numbers over 100: @numbersOver100\n"; Numbers: 101 235 10 50 100 66 1005 Numbers over 100: 101 235 1005
Applying a function to an array • The map command applies a function to every element in an array • Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST • Example: multiply every number by 3 @numbers = (101, 235, 10, 50, 100, 66, 1005); @numbersTimes3 = map ($_ * 3, @numbers); print "Numbers: @numbers\n"; print "Numbers times 3: @numbersTimes3\n"; Numbers: 101 235 10 50 100 66 1005 Numbers times 3: 303 705 30 150 300 198 3015
Review: pattern-matching • The following code:prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the default variable, $_ • Instead of using $_ we can "bind" the pattern to another variable (e.g. $dna) using this syntax: • We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax: • We can replace all occurrences by appending a 'g': if (/ACGCGT/) { print "Found MCB binding site!\n"; } if ($dna =~ /ACGCGT/) { print "Found MCB binding site!\n"; } $dna =~ s/ACGCGT/_MCB_/; $dna =~ s/ACGCGT/_MCB_/g;