200 likes | 211 Views
Computer Science for Bioinformatics, 140.636 F. Pineda. Please fill out questionnaire while waiting for class to start If you did not get permission from instructor to register for credit please talk to me after class...it could end badly.
E N D
Computer Science for Bioinformatics, 140.636F. Pineda • Please fill out questionnaire while waiting for class to start • If you did not get permission from instructor to register for credit please talk to me after class...it could end badly...
If all you have is a hammer, everything looks like a nail. • You can do just about anything in any of the modern languages, but different languages are optimized for different purposes, but the right language makes things simple... • Rapid Prototyping or scripting languages • Perl– Best for tasks that are 90% text manipulation and 10% everything else. • R– Best for tasks that are 90% stats/plotting and 10% everything else • Python– Good general purpose programming language. • Widely used development languages (strongly typed) • C – best for low-level programming where speed matters • Java – Very widely used • Databases • SQL – the standard language for relational databases
Some typical text-oriented bioinformatics tasks • I have a huge output file from a BLAST search. What are all the Mus musculus matches with high scores? • I want to digest (with Trypsin) each of the ~900 proteins in a FASTA file and calculate the mass of each fragments. Oh, I also want the results in a format that I can feed to the MASCOT proteomics search engine . • What’s the most common 6-mer consisting of only A’s and T’s in the P. falciparum genome (14 chromosomes and 2.2 million nucleotides) • I want to automatically update the database on our local BLAST server every weekend with the latest sequences deposited at NCBI. • Search for potential RNA hairpins in the 14 chromosomes of the Plasmodium falciparum genome The common characteristic of these tasks is that theyinvolve ~90% text manipulation, and ~10% everything else.
A quickie task Question: What 6-mer containing only A’s and T’s occurs mostly commonly in the P. falciparumgenome Answer: Scan through the 2.2 million nucleotides of the 14 chromosomes using a 6-nt windowand find that ATATAT occurs most frequently Runs in seconds on a modern laptop --------- A: 40.31 % T: 40.30 % C: 9.70 % G: 9.69 % total(ATCG only ) = 22853497 nucleotides total(everything) = 22853764 nucleotides --------- 1 ATATAT 570577 2 TATATA 529826 3 AAAAAA 420232 4 TTTTTT 417439 5 TATTAT 147506 6 ATAATA 146948 7 TATATT 146711 8 AATATA 146695 -- snip -- 57 ATTTAA 50617 58 ATTAAT 49856 59 TTAAAT 49731 60 TAATTA 45781 61 AATTAA 45278 62 TTAATT 45212 63 AATTTA 45101 64 TAAATT 44553
A bigger task: “middleware” for a “pipeline” • Applications for processing raw data from DNA sequencers • A trace editor to analyze, display, and allow biologists to edit the short DNA read chromatograms from DNA-sequencing machines. • A read assembler to find overlaps between the reads and assemble them in long contiguous sections. • An assembly editor to view the assemblies and make changes in places where the assembler went wrong. • A database to keep track of everything. • Write a set of scripts that runs the applications sequentially and converts the output format of one application into the input format of the subsequent application. see e.g. L. Stein “How Perl Saved the Human Genome Project”
MBL/TIGR Assembly Pipeline From Advanced Genomics and Bioinformatics course F. Pineda, 10/31/2002 ABI 3700 sequencer sample DNA .chr Traceviewer Legend PreTA (Lucy, Phred) PUC19 PUC19.splice applications Closure format conversion files .x .y Lab phd2fasta Consed .qual .seq No Closed? .fasta RunTA TIGR Assembler .contig ace2contig .ace Yes RepeatFinder .asm ta2ace Finished sequence .repeats .stats .details .dot Bambus .mates ? To database for annotation
Got perl? • Unix, Linux and MacOS systems come with Perl. • Open a shell window for entering command lines • Find the DEFAULT perl interpreter by typing: whereis perl • To find out which version of perl you have type “perl -v” • If you insist on working on a Windows PC, you have two choices (neither of which we will support): • Cygwin (a “linux-like environment with a limited set of tools) • http://www.cygwin.com/ • Active state perl (a native perl interpreter). • (http://www.activestate.com/activeperl/)
Invoking the perl interpreter • Perl is an application that processes perl statements. • Feeding perl a one line command • perl -e ‘print “hello world\n”’ • Feeding perl a sequence of commands • step 1. invoke perl • step 2. write the statement(s) • step 3. tell perl you are finished writing statements (control-D) (this causes perl to interpret the statements) print “an old pond\nthe sound of a frog jumping\ninto water\n”;
Perl application = Compiler+Virtual Machine • Java, Perl, Python, Forth, etc. are implemented as virtual machines. • The perl interpreter is an application consisting of a compiler and a “virtual machine.” • The perl interpreter accepts input consisting of perl instructions and executes them in a two-step process: • A compiler converts source code into instructions for a “virtual machine” • Bottom-up parser • Top-down Optimizer & Peephole Optimizer • Generates Opcodes (instructions) • Code is compiled each time it is executed – no binaries • The Perl “virtual machine” executes the instructions • Executes the opcodes • Emulates a stack-based computer (like HP calculator or Forth)
1st perl program: Hello World # my first perl program print(“hello world\n”); • Text between ‘#’ and the end of the line is a comment • print() is a function that prints the text string • Double quotes delimit a string • The two characters \n are not interpreted literally. Instead they represents newline. • Run the program by invoking the perl interpreter with hello.pl as an argument. perl helloworld.pl
stand alone perl script • Three things to remember • Add a special first line that will tell bash where to find perl • Tell bash it has permission to execute this script (just once) • Invoke the script #!/usr/bin/perl # my first perl script print(“hello world\n”); bash> chmod u+x helloperl.pl bash> ./helloperl.pl
Character data, Sequences & codes • 99% of what you will do with perl is to manipulate characters and text files, so you might as well get intimately familiar with character data and how it is represented (coded) on computers. • Many important biological compounds are polymers. These are compounds of usually high molecular weight consisting of up to millions of repeated linked units, each a relatively light and simple molecule. Different single letter codes are used to represent simple molecules in different classes of compounds (e.g. DNA, RNA and proteins).
Biological sequencesrepresenting information contained in biological polymers
DNA RNA Proteins The fundamental “dogma” • after discovery of retroviruses Replication Reverse-transcription Transcription Translation
DNA • Double stranded molecule with a spiral arrangement • Nucleotides pair by hydrogen bonding • Purines (A,G) bind to pyrimidines (T,C) A pairs to T with 2 hydrogen bonds G pairs to C with 3 hydrogen bonds • The strands run opposite to each other (antiparallel) • 4 letter alphabet: A,T,G,C • A human genome is about 3 billion base-pairs long spread across 23 chromosomes
RNA • Single stranded molecule with complex secondary structure • Nucleotides pair by hydrogen bonding • Purines bind to pyrimidines • A pairs to U with 2 hydrogen bonds • G pairs to C with 3 hydrogen bonds • G pairs with U ( a wobble pair) • 4 letter alphabet: A,U,G,C
Summary of the important polymers & their “alphabets” • DNA • Double stranded (helical) • deoxyribose-phosphate backbone • 4 letter alphabet (A,T,C,G) represent nucleotides • Complementary base pairing (A-T), (C-G) • RNA • Single stranded (messenger RNA) • ribose-phosephate backbone • 4 letter alphabet (A,U,C,G) • Complementary base pairing (A-U), (C-G),(U-G) • Proteins • Single stranded (complex folding structure) • 20 letter alphabet (represent amino acid residues)
downstream upstream Notational and directional conventions “gene” 5’ sense (+) strand 3’ DNA anti-sense (-) strand (template) 3’ 5’ promoter RNA polymerease Transcription RNA 5’ 3’ UTR UTR Ribosome Translation Protein N-terminal ( amino end ) C -terminal (carboxyl end)
Suppose I write out the human genome sequence as one long text file.How many human genomes will fit on a 4GB usb stick or how many 4GB sticks do I need to store 1 human genome?Don’t overthink this.