1 / 64

Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb

Learn the fundamentals of bioinformatics, including sequence manipulation, analysis, and prediction. Discover how to clean and digest DNA sequences, find protein-coding genes, and assemble genomes.

fisherk
Download Presentation

Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu北京林业大学计算生物学中心www.bjfuccb.eduBioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu北京林业大学计算生物学中心www.bjfuccb.edu

  2. Hash • Initialize: my %hash = (); • Add key/value pair: $hash{$key} = $value; • Add more keys: • %hash = ( 'key1', 'value1', 'key2', 'value2 ); • %hash = ( key1 => 'value1', key2 => 'value2', ); • Delete: delete $hash{$key};

  3. Print to file • Open a file to print • open FILE, ">filename.txt"; • open (FILE, ">filename.txt“); • Print to the file • print FILE $str;

  4. #Append open(FILE, ">>out") or die "Cannot open file to write"; print FILE "Test\n"; close FILE; exit;

  5. #!/usr/bin/perl print "My name is $0 \n"; print "First arg is: $ARGV[0] \n"; print "Second arg is: $ARGV[1] \n"; print "Third arg is: $ARGV[2] \n"; $num = $#ARGV + 1; print "How many args? $num \n"; print "The full argument string was: @ARGV \n";

  6. use BeginPerlBioinfo; my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; my @locations = ( ); @file_data = get_file_data($ARGV[0]); $dna = extract_sequence_from_fasta_data(@file_data); %rebase_hash = parseREBASE($ARGV[1]); do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) {exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } } until ( $query =~ /quit/ ); exit;

  7. Regular Expression • ^ beginning of string • $ end of string • . any character except newline • * match 0 or more times • + match 1 or more times • ? match 0 or 1 times; • | alternative • ( ) grouping; “storing” • [ ] set of characters • { } repetition modifier • \ quote or special

  8. \

  9. []

  10. $mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d)/) { print "The first digit is $1."; } if($mystring =~ m/(\d+)/) { print "The first number is $1."; } if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; } while($mystring =~ m/(\d+)/g) { print "Found number $1."; } @myarray = ($mystring =~ m/(\d+)/g); print join(",", @myarray);

  11. $mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; }

  12. $mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; }

  13. Download and install programs • Unzip or untar • unzip • If file.tar.gz, tar xvfz file.tar.gz • Go to the directory and “./configure” • Then “make”

  14. Excercies • Download clustalw • Try to install it

  15. System subroutine system ("ls –ltr");

  16. Excercies 2 • Use pro.fasta • Find alignment for each triple of protein • Let’s design the program together • Use “system” in perl • system ("command parameters");

  17. sub ReadFasta { my ($fname) = @_; open(FILE, $fname) or die "Cannot open $fname\n"; my $data = ""; my @dnas = (); while(my $line = <FILE>) { if ($line =~ /^>/) { if ($data ne "") { push(@dnas, $data); } $data = ""; } $data .= $line; } if ($data ne "") { push(@dnas, $data); } close FILE; return @dnas; }

  18. print "Please input file name:\n"; my $fname = <STDIN>; my @dnas = ReadFasta($fname); my $len = $#dnas + 1; for (my $i = 0; $i < $len; $i++) { for (my $j = $i+1; $j < $len; $j++) { for (my $k = $j+1; $k < $len; $k++) { $fname = "$i\_$j\_$k"; print $fname; open(OUT, ">$fname"); print OUT $dnas[$i]; print OUT $dnas[$j]; print OUT $dnas[$k]; close OUT; system ("./clustalw2 $i\_$j\_$k"); } } }

  19. Working with Single DNA Sequences

  20. Learning Objectives • Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR • Find out about gene-prediction methods, their potential, and their limitations • Understand how genomes and sequences and assembled

  21. Outline • Cleaning your DNA of contaminants • Digesting your DNA in the computer • Finding protein-coding genes in your DNA sequence • Assembling a genome

  22. Cleaning DNA Sequences • In order to sequence genomes, DNA sequences are often cloned in a vector (plasmid, YAC, or cosmide) • Sequences of the vector can be mixed with your DNA sequence • Before working with your DNA sequence, you should always clean it with VecScreen

  23. VecScreen • http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html • Runs a special version of Blast • A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin

  24. What to do if hits found • If hits are in the extremity, can just remove them • If in the middle, or vectors are not what you are using, the safest thing is to throw the sequence away

  25. Computing a Restriction Map • It is possible to cut DNA sequences using restriction enzymes • Each type of restriction enzyme recognizes and cuts a different sequence: • EcoR1: GAATTC • BamH1: GGATCC • There are more than 900 different restriction enzymes, each with a different specificity • The restriction map is the list of all potential cleavage sites in a DNA molecule • You can compile a restriction map with www.firstmarket.com/cutter

  26. Cannot get it work!

  27. http://biotools.umassmed.edu/tacg4

  28. Making PCR with a Computer • Polymerase Chain Reaction (PCR) is a method for amplifying DNA • PCR is used for many applications, including • Gene cloning • Forensic analysis • Paternity tests • PCR amplifies the DNA between two anchors • These anchors are called the PCR primer

  29. Designing PCR Primers • PCR primes are typically 20 nucleotides long • The primers must hybridize well with the DNA • On biotools.umassmed.edu, find the best location for the primers: • Most stable • Longest extension

  30. Analyzing DNA Composition • DNA composition varies a lot • Stability of a DNA sequence depends on its G+C content (total guanine and cytosine) • High G+C makes very stable DNA molecules • Online resources are available to measure the GC content of your DNA sequence • Also for counting words and internal repeats

  31. http://helixweb.nih.gov/emboss/html/

  32. Counting words • ATGGCTGACT • A, T, G, G, C, T, G, A, C, T • AT, TG, GG, GC, CT, TG, GA, AC, CT • ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT

  33. www.genomatix.de/cgi-bin/tools/tools.pl

  34. EMBOSS servers • European Molecular Biology Open Software Suite • http://pro.genomics.purdue.edu/emboss/

  35. ORF • EMBOSS • NCBI

More Related