Bioinformatics 生物信息学理论和实践唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb

Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu北京林业大学计算生物学中心www.bjfuccb.eduBioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu北京林业大学计算生物学中心www.bjfuccb.edu

Hash • Initialize: my %hash = (); • Add key/value pair: $hash{$key} = $value; • Add more keys: • %hash = ( 'key1', 'value1', 'key2', 'value2 ); • %hash = ( key1 => 'value1', key2 => 'value2', ); • Delete: delete $hash{$key};

Print to file • Open a file to print • open FILE, ">filename.txt"; • open (FILE, ">filename.txt“); • Print to the file • print FILE $str;

#Append open(FILE, ">>out") or die "Cannot open file to write"; print FILE "Test\n"; close FILE; exit;

#!/usr/bin/perl print "My name is $0 \n"; print "First arg is: $ARGV[0] \n"; print "Second arg is: $ARGV[1] \n"; print "Third arg is: $ARGV[2] \n"; $num = $#ARGV + 1; print "How many args? $num \n"; print "The full argument string was: @ARGV \n";

use BeginPerlBioinfo; my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; my @locations = ( ); @file_data = get_file_data($ARGV[0]); $dna = extract_sequence_from_fasta_data(@file_data); %rebase_hash = parseREBASE($ARGV[1]); do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) {exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } } until ( $query =~ /quit/ ); exit;

Regular Expression • ^ beginning of string • $ end of string • . any character except newline • * match 0 or more times • + match 1 or more times • ? match 0 or 1 times; • | alternative • ( ) grouping; “storing” • [ ] set of characters • { } repetition modifier • \ quote or special

$mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d)/) { print "The first digit is $1."; } if($mystring =~ m/(\d+)/) { print "The first number is $1."; } if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; } while($mystring =~ m/(\d+)/g) { print "Found number $1."; } @myarray = ($mystring =~ m/(\d+)/g); print join(",", @myarray);

$mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; }

Download and install programs • Unzip or untar • unzip • If file.tar.gz, tar xvfz file.tar.gz • Go to the directory and “./configure” • Then “make”

Excercies • Download clustalw • Try to install it

System subroutine system ("ls –ltr");

Excercies 2 • Use pro.fasta • Find alignment for each triple of protein • Let’s design the program together • Use “system” in perl • system ("command parameters");

sub ReadFasta { my ($fname) = @_; open(FILE, $fname) or die "Cannot open $fname\n"; my $data = ""; my @dnas = (); while(my $line = <FILE>) { if ($line =~ /^>/) { if ($data ne "") { push(@dnas, $data); } $data = ""; } $data .= $line; } if ($data ne "") { push(@dnas, $data); } close FILE; return @dnas; }

print "Please input file name:\n"; my $fname = <STDIN>; my @dnas = ReadFasta($fname); my $len = $#dnas + 1; for (my $i = 0; $i < $len; $i++) { for (my $j = $i+1; $j < $len; $j++) { for (my $k = $j+1; $k < $len; $k++) { $fname = "$i\_$j\_$k"; print $fname; open(OUT, ">$fname"); print OUT $dnas[$i]; print OUT $dnas[$j]; print OUT $dnas[$k]; close OUT; system ("./clustalw2 $i\_$j\_$k"); } } }

Working with Single DNA Sequences

Learning Objectives • Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR • Find out about gene-prediction methods, their potential, and their limitations • Understand how genomes and sequences and assembled

Outline • Cleaning your DNA of contaminants • Digesting your DNA in the computer • Finding protein-coding genes in your DNA sequence • Assembling a genome

Cleaning DNA Sequences • In order to sequence genomes, DNA sequences are often cloned in a vector (plasmid, YAC, or cosmide) • Sequences of the vector can be mixed with your DNA sequence • Before working with your DNA sequence, you should always clean it with VecScreen

VecScreen • http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html • Runs a special version of Blast • A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin

What to do if hits found • If hits are in the extremity, can just remove them • If in the middle, or vectors are not what you are using, the safest thing is to throw the sequence away

Computing a Restriction Map • It is possible to cut DNA sequences using restriction enzymes • Each type of restriction enzyme recognizes and cuts a different sequence: • EcoR1: GAATTC • BamH1: GGATCC • There are more than 900 different restriction enzymes, each with a different specificity • The restriction map is the list of all potential cleavage sites in a DNA molecule • You can compile a restriction map with www.firstmarket.com/cutter

Cannot get it work!

http://biotools.umassmed.edu/tacg4

Making PCR with a Computer • Polymerase Chain Reaction (PCR) is a method for amplifying DNA • PCR is used for many applications, including • Gene cloning • Forensic analysis • Paternity tests • PCR amplifies the DNA between two anchors • These anchors are called the PCR primer

Designing PCR Primers • PCR primes are typically 20 nucleotides long • The primers must hybridize well with the DNA • On biotools.umassmed.edu, find the best location for the primers: • Most stable • Longest extension

Analyzing DNA Composition • DNA composition varies a lot • Stability of a DNA sequence depends on its G+C content (total guanine and cytosine) • High G+C makes very stable DNA molecules • Online resources are available to measure the GC content of your DNA sequence • Also for counting words and internal repeats

http://helixweb.nih.gov/emboss/html/

Counting words • ATGGCTGACT • A, T, G, G, C, T, G, A, C, T • AT, TG, GG, GC, CT, TG, GA, AC, CT • ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT

www.genomatix.de/cgi-bin/tools/tools.pl

EMBOSS servers • European Molecular Biology Open Software Suite • http://pro.genomics.purdue.edu/emboss/

ORF • EMBOSS • NCBI

Bioinformatics 生物信息学理论和实践唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb