Bioinformatics 生物信息学理论和实践唐继军 jtang@cse.sc 13928761660

Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660

Exercise 1 • Ask for a protein file in fasta format • Ask for an amino acid • Count the frequency of that amino acid • TKFHSNAHFYDCWRMLQYQLDMRCMRAISTFSPHCGMEHMPDQTHNQGEMCKPRMWQVSMNQSCNHTPPFRKTYVEWDYMAKALIAPYTLGWLASTCFIW

Exercise 2 • Ask for an RNA file in fasta format • Convert it to RNA • Ask for a codon • Count the frequency of that codon • TCGTACTTAGAAATGAGGGTCCGCTTTTGCCCACGCACCTGATCGCTCCTCGTTTGCTTTTAAGAACCGGACGAACCACAGAGCATAAGGAGAACCTCTAGCTGCTTTACAAAGTACTGGTTCCCTTTCCAGCGGGATGCTTTATCTAAACGCAATGAGAGAGGTATTCCTCAGGCCACATCGCTTCCTAGTTCCGCTGGGATCCATCGTTGGCGGCCGAAGCCGCCATTCCATAGTGAGTTCTTCGTCTGTGTCATTCTGTGCCAGATCGTCTGGCAAATAGCCGATCCAGTTTATCTCTCGAAACTATAGTCGTACAGATCGAAATCTTAAGTCAAATCACGCGACTAGACTCAGCTCTATTTTAGTGGTCATGGGTTTTGGTCCCCCCGAGCGGTGCAACCGATTAGGACCATGTAGAACATTAGTTATAAGTCTTCTTTTAAACACAATCTTCCTGCTCAGTGGTACATGGTTATCGTTATTGCTAGCCAGCCTGATAAGTAACACCACCACTGCGACCCTAATGCGCCCTTTCCACGAACACAGGGCTGTCCGATCCTATATTACGACTCCGGGAAGGGGTTCGCAAGTCGCACCCTAAACGATGTTGAAGGCTCAGGATGTACACGCACTAGTACAATACATACGTGTTCCGGCTCTTATCCTGCATCGGAAGCTCAATCATGCATCGCACCAGCGTGTTCGTGTCATCTAGGAGGGGCGCGTAGGATAAATAATTCAATTAAGATATCGTTATGCTAGTATACGCCTACCCGTCACCGGCCAACAGTGTGCAGATGGCGCCACGAGTTACTGGCCCTGATTTCTCCGCTTCTAATACCGCACACTGGGCAATACGAGCTCAAGCCAGTCTCGCAGTAACGCTCATCAGCTAACGAAAGAGTTAGAGGCTCGCTAAATCGCACTGTCGGGGTCCCTTGGGTATTTTACACTAGCGTCAGGTAGGCTAGCATGTGTCTTTCCTTCCAGGGGTATG

Subroutine • Some code needs to be reused • A good way to organize code • Called "function" in some languages • Name • Return • Parameters (@_)

sub codon2aa { my($codon) = @_; if ( $codon =~ /GC./i) { return 'A' } Alanine elsif ( $codon =~ /TG[TC]/i) { return 'C' } Cysteine elsif ( $codon =~ /GA[TC]/i) { return 'D' } Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return 'E' } Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return 'F' } Phenylalanine elsif ( $codon =~ /GG./i) { return 'G' } Glycine elsif ( $codon =~ /CA[TC]/i) { return 'H' } Histidine elsif ( $codon =~ /AT[TCA]/i) { return 'I' } Isoleucine elsif ( $codon =~ /AA[AG]/i) { return 'K' } Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } Leucine elsif ( $codon =~ /ATG/i) { return 'M' } Methionine elsif ( $codon =~ /AA[TC]/i) { return 'N' } Asparagine elsif ( $codon =~ /CC./i) { return 'P' } Proline elsif ( $codon =~ /CA[AG]/i) { return 'Q' } Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } Serine elsif ( $codon =~ /AC./i) { return 'T' } Threonine elsif ( $codon =~ /GT./i) { return 'V' } Valine elsif ( $codon =~ /TGG/i) { return 'W' } Tryptophan elsif ( $codon =~ /TA[TC]/i) { return 'Y' } Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } Stop else {print STDERR "Bad codon \"$codon\"!!\n"; exit; } }

!/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n"; sub dna2peptide { my ($dna) = @_; my $protein = ""; for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); $protein .= codon2aa($codon); } return $protein; } sub codon2aa { ... }

Modules • A Perl Module is a self-contained pieceof [Perl] code that can be used by a Perl program later • Like a library • End with extension .pm • Needs a 1 at the end

Bio.pm sub codon2aa { .... .... } sub dna2peptide { .... .... } 1

!/usr/bin/perl -w use Bio; print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

Bio.pm sub codon2aa { .... .... } sub dna2peptide { .... .... } sub fasta_read { print "Please type the filename: "; my $dna_filename = <STDIN>; chomp $dna_filename; unless (open(DNAFILE, $dna_filename)) { print "Cannot open file ", $dna_filename, "\n"; } $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; return $DNA; } 1

!/usr/bin/perl -w use Bio; $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

Scope • my provides lexical scoping; a variable declared with my is visible only within the block in which it is declared. • Blocks of code are hunks within curly braces {}; files are blocks. • Use use vars qw([list of var names]) or our ([var_names]) to create package globals.

!/usr/bin/perl -w use Bio; use strict; use warnings; $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

Variable "$DNA" is not imported at frame2.pl line 6. Variable "$DNA" is not imported at frame2.pl line 8. Variable "$DNA" is not imported at frame2.pl line 9. Variable "$DNA" is not imported at frame2.pl line 10. Variable "$DNA" is not imported at frame2.pl line 12. Variable "$DNA" is not imported at frame2.pl line 12. Variable "$DNA" is not imported at frame2.pl line 13. Variable "$DNA" is not imported at frame2.pl line 14. Variable "$DNA" is not imported at frame2.pl line 15. Global symbol "$DNA" requires explicit package name at frame2.pl line 6. Global symbol "$DNA" requires explicit package name at frame2.pl line 8. Global symbol "$DNA" requires explicit package name at frame2.pl line 9. Global symbol "$DNA" requires explicit package name at frame2.pl line 10. Global symbol "$DNA" requires explicit package name at frame2.pl line 12. Global symbol "$DNA" requires explicit package name at frame2.pl line 12. Global symbol "$DNA" requires explicit package name at frame2.pl line 13. Global symbol "$DNA" requires explicit package name at frame2.pl line 14. Global symbol "$DNA" requires explicit package name at frame2.pl line 15. Execution of frame2.pl aborted due to compilation errors.

!/usr/bin/perl -w use Bio; use strict; use warnings; my $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

my $x = 10; for (my $x = 0; $x < 5; $x++) { Scope(); print $x, "\n"; } print $x, "\n"; sub Scope { my $x = 0; }

sub get_file_data { my($filename) = @_; use strict; use warnings; # Initialize variables my @filedata = ( ); unless( open(GET_FILE_DATA, $filename) ) { print STDERR "Cannot open file \"$filename\"\n\n"; exit; } @filedata = <GET_FILE_DATA>; close GET_FILE_DATA; return @filedata; }

sub extract_sequence_from_fasta_data { my(@fasta_file_data) = @_; my $sequence = ''; foreach my $line (@fasta_file_data) { if ($line =~ /^\s*$/) { next; } elsif($line =~ /^\s*#/) { next; } elsif($line =~ /^>/) { next; } else { $sequence .= $line; } } # remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; }

Molecular Scissors Molecular Cell Biology, 4th edition

Discovering Restriction Enzymes • HindII - first restriction enzyme – was discovered accidentally in 1970 while studying how the bacterium Haemophilus influenzae takes up DNA from the virus • Recognizes and cuts DNA at sequences: • GTGCAC • GTTAAC

Recognition Sites of Restriction Enzymes Molecular Cell Biology, 4th edition

Uses of Restriction Enzymes • Recombinant DNA technology • Cloning • cDNA/genomic library construction • DNA mapping

Restriction Enzyme Database • http://rebase.neb.com/rebase/rebase.html

http://rebase.neb.com/rebase/rebase.files.html

R = G or A Y = C or T M = A or C K = G or T S = G or C W = A or T B = not A (C or G or T) D = not C (A or G or T) H = not G (A or C or T) V = not T (A or C or G) N = A or C or G or T

sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ‘’; my %iub2character_class = ( A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', ); $iub =~ s/\^//g; for ( my $i = 0 ; $i < length($iub) ; ++$i ) { $regular_expression .= $iub2character_class{substr($iub, $i, 1)}; } return $regular_expression; }

Hash • Initialize: my %hash = (); • Add key/value pair: $hash{$key} = $value; • Add more keys: • %hash = ( 'key1', 'value1', 'key2', 'value2 ); • %hash = ( key1 => 'value1', key2 => 'value2', ); • Delete: delete $hash{$key};

while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } for my $key ( keys %hash ) { my $value = $hash{$key}; print "$key => $value\n"; }

sub codon2aa { my($codon) = @_; $codon = uc $codon; my %genetic_code = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine 'TTT' => 'F', # Phenylalanine 'TTA' => 'L', # Leucine 'TTG' => 'L', # Leucine #Many more ); if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }else{ print STDERR "Bad codon \"$codon\"!!\n"; exit; } }

sub parseREBASE { my($rebasefile) = @_; my @rebasefile = ( );my %rebase_hash = ( );my $name;my $site;my $regexp; open($rebase_filehandle, $rebasefile) or die "Cannot open file\n"; while(<$rebase_filehandle>) { # Discard header lines ( 1 .. /Rich Roberts/ ) and next; # Discard blank lines /^\s*$/ and next; # Split the two (or three if includes parenthesized name) fields my @fields = split( " ", $_); $name = shift @fields; $site = pop @fields; # Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site); # Store the data into the hash $rebase_hash{$name} = "$site $regexp"; } # Return the hash containing the reformatted REBASE data return %rebase_hash; }

Range • ( 1 .. /Rich Roberts/ ) and next • from first line till some line containing Rich Roberts • If that is true, it will check the statement after "and" • If that is not true, it will not check the statement after "and" • open(…) or die • If can open, the statement is already true, no need to check the statement after "or" • If cannot open, the statement is false, need to check the statement after "or" to see if it can be true

@fred = (1,2,3); @barney = @fred; @huh = 1; @fred = qw(one two); @barney = (4,5,@fred,6,7); @barney = (8,@barney); ($a,$b,$c) = (1,2,3); @fred = (@barney = (2,3,4)); @fred = @barney = (2,3,4); @fred = (1,2,3); $fred[3] = "hi"; $fred[6] = "ho"; # @fred is now (1,2,3,"hi",undef,undef,"ho")

Array operators • push and pop (right-most element) • @mylist = (1,2,3); push(@mylist,4,5,6); • $oldvalue = pop(@mylist); • shift and unshift (left-most element) • @fred = (5,6,7); unshift(@fred,2,3,4); • $x = shift(@fred); • reverse: @a = (7,8,9); @b = reverse(@a); • sort: @a = (7,9,9); @b = sort(@a);

sub match_positions { my($regexp, $sequence) = @_; use BeginPerlBioinfo; my @positions = ( ); while ( $sequence =~ /$regexp/ig ) { push ( @positions, pos($sequence) - length($&) + 1); } return @positions; }

use BeginPerlBioinfo; my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; my @locations = ( ); @file_data = get_file_data("sample.dna"); $dna = extract_sequence_from_fasta_data(@file_data); %rebase_hash = parseREBASE('bionet'); do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) {exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } } until ( $query =~ /quit/ ); exit;

Bioinformatics 生物信息学理论和实践唐继军 jtang@cse.sc 13928761660