Topics

Topics • Logical expression • string functions: substr and index • random numbers and mutation • hashes • Transcription, translation, genetic code • Quiz 2 BINF634 FALL13 - LECTURE 4

Any expression in Perl can be interpreted as a logical value true or false Scalar context: false if 0, "", or undefined otherwise true List context: false if () or undefined true otherwise my $x; # or: my $x = 0; if ($x) { $x++ } else { $x = 17 } print "$x\n"; 17 my $x = 2; while ($x) { print $x--, "\n"; } 2 1 my @a = ("A", "T"); while (@a) { print shift @a, "\n"; } A T Logical Value of Expressions BINF634 FALL13 - LECTURE 4

Logical Operators Logical operator "short-circuit": only evaluate second argument if necessary Example: open(GRADES, "grades") or die "Can't open file grades\n"; See Wall (p 109-110) for discussion of C-style &&, || and ! operators BINF634 FALL13 - LECTURE 4

String Functions: substr substr EXPR, OFFSET, LENGTH, REPLACEMENT Return substring of string EXPR at position OFFSET and length LENGTH $s = "hello, world!"; $x = substr $s, 1, 1; # $x = "e" $x = substr $s, 0, 3; # $x = "hel" $x = substr $s, 7; # $x = "world!" # The substring is replaced by REPLACEMENT if used: substr($s,0,5,"goodbye"); # $s = "goodbye, world!"; # This does the same thing: $s = "hello, world!"; substr($s,0,5) = "goodbye"; Note: Predefined Perl functions may be used with or without parentheses around their arguments BINF634 FALL13 - LECTURE 4

# From example7-4.pl # matching_percentage # # Subroutine to calculate the percentage of identical bases in two # equal length DNA sequences sub matching_percentage { my($string1, $string2) = @_; # we assume that the strings have the same length my $length = length($string1); my $count = 0; for (my $position=0; $position < $length ; ++$position) { if(substr($string1,$position,1) eq substr($string2,$position,1)) { ++$count; } } return $count / $length; } BINF634 FALL13 - LECTURE 4

#!/usr/bin/perl use strict; use warnings; my $dna = "AAAAATTTTTGGGGGTTTTT"; print_sequence(dna,8); exit; # A subroutine to format and print sequence data sub print_sequence { my($sequence, $length) = @_; # Print sequence in lines of $length for (my $pos = 0 ; $pos < length($sequence); $pos += $length ) { print substr($sequence, $pos, $length), "\n"; } } AAAAATTT TTGGGGGC CCCC BINF634 FALL13 - LECTURE 4

String Functions: index index STR, SUBSTR, OFFSET Return the position of the first occurrence of SUBSTR in STR; If OFFSET is given, skip this many letters before looking Returns -1 if SUBSTR not found $dna = "GATGCCATGAAATGC"; $pos = index $dna, "ATG"; print "ATG found at position $pos\n"; # answer: 1 pos = -1; while (($pos = index($dna, "ATG", $pos)) > -1) { print "ATG found at position $pos\n"; $pos++; } OUTPUT: ATG found at position 1 ATG found at position 6 ATG found at position 11 BINF634 FALL13 - LECTURE 4

Random Number Generation $r = rand; # $r is random between 0 and 1 (0<=$r< 1.0) $r = rand(100); # random between 0 and 100 $d = int rand(101); # random integer from 0 to 100 srand($seed); # seeds the random number generator If a program doesn't call srand(), then it generates different random numbers each time it is run. If program does call srand($seed), we get the same sequence of random number each time the program is run with the same value for $seed. BINF634 FALL13 - LECTURE 4

An Example to Generate Random Stories - I #!/usr/bin/perl -w # Example 7-1 Children's game, demonstrating primitive artificial intelligence, # using a random number generator to randomly select parts of sentences. use strict; use warnings; # Declare the variables my $count; my $input; my $number; my $sentence; my $story; # Here are the arrays of parts of sentences: my @nouns = ( 'Dad', 'TV', 'Mom', 'Groucho', 'Rebecca', 'Harpo', 'Robin Hood', 'Joe and Moe', ); BINF634 FALL13 - LECTURE 4

my @verbs = ( 'ran to', 'giggled with', 'put hot sauce into the orange juice of', 'exploded', 'dissolved', 'sang stupid songs with', 'jumped with', ); my @prepositions = ( 'at the store', 'over the rainbow', 'just for the fun of it', 'at the beach', 'before dinner', 'in New York City', 'in a dream', 'around the world', ); # Seed the random number generator. # time|$$ combines the current time with the current process id # in a somewhat weak attempt to come up with a random seed. srand(time|$$); # This do-until loop composes six-sentence "stories". # until the user types "quit". do { # (Re)set $story to the empty string each time through the loop $story = ''; # Make 6 sentences per story. for ($count = 0; $count < 6; $count++) { An Example to Generate Random Stories - II BINF634 FALL13 - LECTURE 4

# Notes on the following statements: # 1) scalar @array gives the number of elements in the array. # 2) rand returns a random number greater than 0 and # less than scalar(@array). # 3) int removes the fractional part of a number. # 4) . joins two strings together. $sentence = $nouns[int(rand(scalar @nouns))] . " " . $verbs[int(rand(scalar @verbs))] . " " . $nouns[int(rand(scalar @nouns))] . " " . $prepositions[int(rand(scalar @prepositions))] . '. '; $story .= $sentence; } # Print the story. print "\n",$story,"\n"; # Get user input. print "\nType \"quit\" to quit, or press Enter to continue: "; $input = <STDIN>; # Exit loop at user's request } until($input =~ /^\s*q/i); exit; An Example to Generate Random Stories - III BINF634 FALL13 - LECTURE 4

Randomly Selecting and Element of an Array • $verbs[int(rand(scalar @verbs)))] • verbs[int(rand(7))] #Why?? • What does rand(7) do? • How about int(rand(7))? BINF634 FALL13 - LECTURE 4

An Alternative Way to Randomly Select the Elements in an Array • $verbs[int rand scalar @verbs] • This will actually work • $verbs[rand @verbs] #How does this work? BINF634 FALL13 - LECTURE 4

# From example7-2.pl # A subroutine to perform a mutation in a string of DNA # sub mutate { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T'); # Pick a random position in the DNA my($position) = randomposition($dna); # Pick a random nucleotide my($newbase) = randomnucleotide(@nucleotides); # Insert the random nucleotide into the random position in the DNA substr($dna,$position,1,$newbase); return $dna; } BINF634 FALL13 - LECTURE 4

# A subroutine to randomly select a position in a string. sub randomposition { my($string) = @_; # rand returns a decimal number between 0 and its argument. # int returns the integer portion of a decimal number. # # The whole expression returns a random number between 0 and # length-1 return int rand length $string; } # Select at random one of the four nucleotides sub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'); return randomelement(@nucleotides); } # randomly select an element from an array sub randomelement { my(@array) = @_; # return $array[int(rand(scalar @array)]; return $array[rand @array]; } BINF634 FALL13 - LECTURE 4

# Subroutine to perform a mutation in a string of DNA-version 2, in # which it is guaranteed that one base will change on each call. sub mutate_better { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T'); # Pick a random position in the DNA my($position) = randomposition($dna); # Pick a random nucleotide my($newbase); do { $newbase = randomnucleotide(@nucleotides); # Make sure it's different than the nucleotide we're mutating }until ( $newbase ne substr($dna, $position,1) ); # Insert the random nucleotide into the random position in the DNA substr($dna,$position,1,$newbase); return $dna; } BINF634 FALL13 - LECTURE 4

Hashes(Associative Arrays) • A Hash is a collection of zero or more pairs of scalar values, called keys and values • Hash variable names begin with a percent sign (%) %genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA"); • The values are indexed by the keys • Given a key, the hash returns the corresponding value $seq = $genes{"gene2"}; # $seq = "CTGCCATGA" • Note that $genes{"gene2"} is a scalar, so it starts with $ BINF634 FALL13 - LECTURE 4

Hashes • Hashes can be assigned values use key=>value notation: %genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA"); %genes = ( gene1=>"ATTCGT", gene2=>"CTGCCATGA"); • Hash elements can be created/altered by assignment statements: $genes{"gene1"} = "ATTCGT"; $genes{gene2} = "CTGCCATGA"; # note: no quotes in key BINF634 FALL13 - LECTURE 4

Hashes (Associative Arrays) %genomes = ( ); # creates an empty hash # two ways to do the same thing: %genomes = ( "virus", 31, "bacteria", 89, "plants", 5 ); %genomes = ( virus => 31, bacteria => 89, plants => 5 ); $genomes{mammals} = 2; # adds a new pair to the hash @genome_list = keys %genomes; # @genome list is now ("plants" , "mammals", "bacteria", "virus") @genome_counts = values %genomes; # @genome_counts is now (5, 2, 89, 31) # keys and values are not guaranteed to return the data is same order # as it was entered, but they are guaranteed to return the data in the # same order as each other. BINF634 FALL13 - LECTURE 4

Hashes The keys function returns a list of all keys in a hash (in some random order) %genes = (gene2=>"CTGCCATGA", gene1=>"ATTCGT"); @key_list = keys(%genes); print "@key_list\n"; # prints: gene1 gene2 # often used to loop through a hash: foreach $key (@key_list) { print "The value of $key is $genes{$key}\n"; } Output: The value of gene1 is ATTCGT The value of gene2 is CTGCCATGA BINF634 FALL13 - LECTURE 4

Hashes # exists($H{$key}) returns TRUE if $key occurs in hash %H # print all the words in a file with their counts my ($file) = @ARGV; open FH, $file; my @lines = <FH>; close FH; my %count = (); for my $line (@lines) { for my $word (split " ", $line) { if (not exists $count{$word}) { # initialize the count for a new word $count{$word} = 1; } else { # update the count for an existing word $count{$word}++; } } } for my $key (sort keys %count) { print "$key $count{$key}\n"; } What does each line in this program do? BINF634 FALL13 - LECTURE 4

Hashes % cat testfile2 a text file with lots of words some words occur once and some words occur more than once % wordcount.pl testfile2 a 1 and 1 file 1 lots 1 more 1 occur 2 of 1 once 2 some 2 text 1 than 1 with 1 words 3 BINF634 FALL13 - LECTURE 4

Hashes (Associative Arrays) • The each function returns a two-element list containing one key from the hash and its associated value • Subsequent calls to each will return another pair, until all pairs have been returned (at which point an empty array is returned) while ( ($genome, $count) = each %genomes ) ) { print “$genome $count\n”; } OUTPUT: (possibly not in this order) plants 5 virus 31 bacteria 89 mammals 2 BINF634 FALL13 - LECTURE 4

More on Hashes (Associative Arrays) • Assigning the return value from values or keys to a scalar gives the number of pairs in a hash: $genome_count = keys %genomes; # $genome_count is now 4 $genome_count = values %genomes; # $genome_count is now 4 • The delete function removes a pair from a hash delete $genomes{bacteria}; $genome_count = keys %genomes; # $genome_count is now 3 BINF634 FALL13 - LECTURE 4

Transcription and Translation • DNA is transcribed to mRNA, mRNA is translated to protein • Start with double stranded DNA ATTCGAGCATGACATCATCGGTA (sense strand) TAAGCTCGTACTGTAGTAGCCAT (complement strand) • DNA double helix separates, allowing polymerase to transcribe one strand: ATTCGAGCATGACATCATCGGTA (sense strand) AUUCGAGCAUGACAUCAUCGGUA (mRNA) | | | | | | | TAAGCTCGTACTGTAGTAGCCAT (complement strand) • mRNA codons translated into protein: AUU CGA GCA UGA CAU CAU CGG … (mRNA) BINF634 FALL13 - LECTURE 4

Codons BINF634 FALL13 - LECTURE 4

Reading Frames • A genomic sequence has 6 reading frames, corresponding to the six possible ways of translating the sequence into three-letter codons. • Frame 1 treats each group of three bases as a codon, starting from the first base • Frame 2 starts at the second base • Frame 3 starts at the third base • Start with the sense strand of DNA: ATTCGAGCATGACATCATCGGTA • Reading frame 1: ATT CGA GCA TGA CAT CAT CGG TA • Reading Frame 2: TTC GAG CAT GAC ATC ATC GGT A • Reading Frame 3: TCG AGC ATG ACA TCA TCG GTA • Each reading frame can be translated into a different protein sequence • Frames 4, 5 and 6 are defined in a similar way, but refer to the opposite strand, which is the reverse complement of the first strand. BINF634 FALL13 - LECTURE 4

Reading Frames • Perl code to process first three reading frames: $dna = ...; for (my $r = 1; $r <= 3; $r++) { my $frame = substr($dna, ?, ?); # fill in the blanks! # process reading frame $r here } # exercise: write a subroutine to return the nth reading frame for a DNA string BINF634 FALL13 - LECTURE 4

BINF634 FALL13 - LECTURE 4

Using the Genetic Code We’ll look at four versions of translating DNA using the genetic code: • Look up the codon using if-then-else • Same as above, but use patterns to reflect redundancy of genetic code • Use a hash to look up each codon • Same as above, but more efficiently • See BeginPerlBioinfo.pm BINF634 FALL13 - LECTURE 4

Version 1 # codon2aa # A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa_v1 { my($codon) = @_; if ( $codon =~ /TCA/i ) { return "S" } # Serine elsif ( $codon =~ /TCC/i ) { return "S" } # Serine elsif ( $codon =~ /TCG/i ) { return "S" } # Serine elsif ( $codon =~ /TCT/i ) { return "S" } # Serine elsif ( $codon =~ /TTC/i ) { return "F" } # Phenylalanine elsif ( $codon =~ /TTT/i ) { return "F" } # Phenylalanine elsif ( $codon =~ /TTA/i ) { return "L" } # Leucine elsif ( $codon =~ /TTG/i ) { return "L" } # Leucine elsif ( $codon =~ /TAC/i ) { return "Y" } # Tyrosine elsif ( $codon =~ /TAT/i ) { return "Y" } # Tyrosine elsif ( $codon =~ /TAA/i ) { return "_" } # Stop elsif ( $codon =~ /TAG/i ) { return "_" } # Stop elsif ( $codon =~ /TGC/i ) { return "C" } # Cysteine elsif ( $codon =~ /TGT/i ) { return "C" } # Cysteine elsif ( $codon =~ /TGA/i ) { return "_" } # Stop elsif ( $codon =~ /TGG/i ) { return "W" } # Tryptophan . . elsif ( $codon =~ /GGT/i ) { return "G" } # Glycine else { print STDERR "Bad codon \"$codon\"!!\n"; exit; } } Remember in relating to the translation table on slide 29 that the T’s become U’s BINF634 FALL13 - LECTURE 4 Problem: takes longer to look up codons near end of list.

Version 2 # codon2aa # A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa_v2 { my($codon) = @_; if ( $codon =~ /GC./i) { return "A" } # Alanine elsif ( $codon =~ /TG[TC]/i) { return "C" } # Cysteine elsif ( $codon =~ /GA[TC]/i) { return "D" } # Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return "E" } # Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return "F" } # Phenylalanine elsif ( $codon =~ /GG./i) { return "G" } # Glycine elsif ( $codon =~ /CA[TC]/i) { return "H" } # Histidine elsif ( $codon =~ /AT[TCA]/i) { return "I" } # Isoleucine elsif ( $codon =~ /AA[AG]/i) { return "K" } # Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return "L" } # Leucine elsif ( $codon =~ /ATG/i) { return "M" } # Methionine elsif ( $codon =~ /AA[TC]/i) { return "N" } # Asparagine elsif ( $codon =~ /CC./i) { return "P" } # Proline elsif ( $codon =~ /CA[AG]/i) { return "Q" } # Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return "R" } # Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return "S" } # Serine elsif ( $codon =~ /AC./i) { return "T" } # Threonine elsif ( $codon =~ /GT./i) { return "V" } # Valine elsif ( $codon =~ /TGG/i) { return "W" } # Tryptophan elsif ( $codon =~ /TA[TC]/i) { return "Y" } # Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return "_" } # Stop else { print STDERR "Bad codon \"$codon\"!!\n"; exit; } } BINF634 FALL13 - LECTURE 4 Still takes longer to look up codons near end of list.

# codon2aa # Translate codon using hash lookup (pp 160-162 and BeginPerlBioinfo.pm) sub codon2aa_v3 { my($codon) = @_; $codon = uc $codon; my(%genetic_code) = ( "TCA" => "S", # Serine "TCC" => "S", # Serine "TCG" => "S", # Serine "TCT" => "S", # Serine "TTC" => "F", # Phenylalanine "TTT" => "F", # Phenylalanine "TTA" => "L", # Leucine "TTG" => "L", # Leucine "TAC" => "Y", # Tyrosine "TAT" => "Y", # Tyrosine "TAA" => "_", # Stop "TAG" => "_", # Stop . . . "GGT" => "G", # Glycine ); if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; } else{ die "Bad codon: $codon";} } Version 3 Efficiency Problem: the hash is redefined for every codon lookup. Robustness Problem: should the program die if the codon is unknown? BINF634 FALL13 - LECTURE 4

More Efficient Solution • Define the genetic code hash just once (outside subroutine) • Use subroutine to perform lookup my(%genetic_code) = ( "TCA" => "S", # Serine "TCC" => "S", # Serine "TCG" => "S", # Serine "TCT" => "S", # Serine "TTC" => "F", # Phenylalanine "TTT" => "F", # Phenylalanine "TTA" => "L", # Leucine "TTG" => "L", # Leucine "TAC" => "Y", # Tyrosine "TAT" => "Y", # Tyrosine "TAA" => "_", # Stop "TAG" => "_", # Stop . . . "GGA" => "G", # Glycine "GGC" => "G", # Glycine "GGG" => "G", # Glycine "GGT" => "G", # Glycine ); BINF634 FALL13 - LECTURE 4

# codon2aa # Translate a DNA 3-character codon to an amino acid # using hash lookup passed as argument sub codon2aa { my($codon) = @_; $codon = uc $codon; if (exists $genetic_code{$codon}) { return $genetic_code{$codon}; } else { return "X"; # alternative error response: # die "Bad codon: $codon"; } } Version 4 The exists function return true if an element with the given key occurs in the hash. BINF634 FALL13 - LECTURE 4

Wrap up • Program #1 was due tonight. Please talk to me privately either via email or in person if you did not turn it in. • Program #2 is posted on course website • hint: Read about BeginPerlBioinfo.pm in chapter 8 • Read the code in BeginPerlBioinfo.pm • Start NOW! • Read: • Tisdall Ch. 9 • Wall Ch. 5 BINF634 FALL13 - LECTURE 4

Topics

Topics

Presentation Transcript

TOPICS

Topics

Topics

Topics

Topics

Topics

Topics:

Topics

Topics

Topics

Topics

Topics

TOPICS

Topics

Topics:

Topics

Topics

Topics

Topics