470 likes | 484 Views
Introduction to Bioinformatic Computation. Lecture #2 01-25-10. Any problems with UNIX commands? (cd, mv, cp, ls, pwd, more, and others ). Objectives for today. A little bit more of UNIX Execution of some important bioinformatic programs on UNIX computer (BLAST, CLUSTAL W)
E N D
Introduction to Bioinformatic Computation. Lecture #2 01-25-10 • Any problems with UNIX commands? (cd, mv, cp, ls, pwd, more, and others)
Objectives for today • A little bit more of UNIX • Execution of some important bioinformatic programs on UNIX computer (BLAST, CLUSTAL W) • Programming in PERL
mco321125.mco.edu LINUX cluster • Administrator (root) Ohio Supercomputer Center (Columbus) Doug Johnson Terry Lewis Peter Bazeley (MUOT) • Administrator assistant Alexei Fedorov (MCO)
File System / (root) bin dev etc home lib misc opt ram sbin tmp usr var afedorov skhuder mamoon … (and the rest of us) (and also other users) My home directory: /home/afedorov
Detailed list: ls -l owner Size name drwxr-xr-x 2 afedorov ibc1 4096 Jan 17 19:42 DB -rw-r--r-- 1 afedorov ibc1 275 Jan 15 16:47 users -rwxrwx--- 1 afedorov ibc1 37 Jan 15 11:22 test.pl -rw-r--r-- 1 afedorov ibc1 272 Jan 15 16:43 sorted permissions group Date of last modification sudo /usr/sbin/useradd –g ibc1afedorov –m –c “Alexei Fedorov”
Permissions and system security directory drwxr-xr-x -rw-r--r-- file group others owner r read w write x execute a program
How to change permissions • Command line (change mode or chmod) chmod 750 filename Result: - rwx r-x --- 0 --- 1 --x 2 -w- 3 -wx 4 r-- 5 r-x 6 rw- 7 rwx x = 1 w = 2 r = 4
Please, change permission in your home directory so that everybody from our group (ibc1) will be able to open it !!! Command line to change mode in a dir: chmod 750 –Rf dir_name
BLAST alignment > GENSCAN_predicted_peptide_2|541_aa MAPSLLYRGPQQSLPLRTSTATAKASSLTIEDPACLISDTAKSKAPTDAGEYVEKREASNTVGHTDASGGLPWPWAALPLWLCRVQPPSWLLSLAGIVCGFSSYTQHVVGNGIKAQSGNPEVKVKGTDPVINQIIDKLKHVIQLLGDEAAQALQVLAVELDVFVPGALHPQRLHGLGAALLERQPLREVDHLVFRAVDNEHGRRDLGHLLDSAGGAWGGLITWGRRRSSRSSWCREGDAQIQGERRVQHHRGGLVARGQVHGGHRADALPVQDDAVRADAVPGGTGAGLESRSWCTGRNRVHSSSLQPLPTPLAMRNSGSCVPTCLHPILDSRSPMLNEVTGVRPLAQLVQHQYQGLLVPSPNHAGPTCHPVDAATPELSSHTLQALIPGRCPPARQEHGQTIKQDTSSVEALSAIWPPPHGSGPRFLQDKAAAGQMAEQTELRAGHGGEARPSPRPPAWSPLRCAQAELPPGSPSPGAQGLPGGVDVGVEVRLGGPARAGTVAGVVVGEDVAVEAGTQANVEAAHLAQVHALPSLFMHRX Comparison of a sequence with a database
BLAST on-line (NCBI)http://www.ncbi.nlm.nih.gov/BLAST 15 November 2003 The BLAST databases in FASTA format will move from .Z to .gz compression. Read more... DisclaimerPrivacy statementAccessibilityValid XHTML 1.0, CSS.
Example of blast alignmentScore = 200 bits (503), Expect = 6e-51 Identities = 118/333 (35%), Positives = 191/333 (56%), Gaps = 11/333 (3%)Query: 28 KIQRLEESVVNRIAAGEVIQRPVSAVKELVENSLDADSSSISVVVKDGGLKLIQVSDDGH 87 KI L E++ N+IAAGEV++RP S VKELVENS+DA SS I + V++ GL+LI+V D+GSbjct: 42 KIIELNEALANQIAAGEVVERPASVVKELVENSIDAGSSKIIINVEEAGLRLIEVIDNGL 101Query: 88 GIRREDLPILCERHTTSKLTKFEDLFSLSSMGFRGEALASMTYVAHVTVTTITKGQIHGY 147 G+ +ED+ + RH TSK+ DLF + ++GFRGEAL S+ V+ +T+ T T + GSbjct: 102 GLEKEDVALALRRHATSKIKDSADLFRIRTLGFRGEALPSIASVSQMTIETSTAEEESGT 161Query: 148 RVSYRDGVMEH-EPKACAAVKGTQIMVENLFYNMIARRKTLQNSADDYGKIVDLLSRMAI 206 ++ + G +E EP A GT+I V NLFYN AR K +++ + I D+++R+++Sbjct: 162 KLVAKGGNIETLEPLAKRV--GTKISVANLFYNTPARLKYIKSLQAELSHITDIINRLSL 219Query: 207 HYNNVSFSCRKHGAVKADVHSVVSPSRLDSIRSVYGVSVAKNLMKVEVSSCDSSGCTFDM 266 + +SF+ G K + + + I ++YG+ AK + +++ S D F++Sbjct: 220 AHPEISFTLVNEG--KEFLKTAGNGDLRQVIAAIYGIGTAKKMRQIKGSDLD-----FEL 272Query: 267 EGFISNSNYV-AKKTILVLFINDRLVECSALKRAIEIVYAATLPKASKPFVYMSINLPRE 325 G++S A + + + IN R ++ L RAI Y L PF +SI +Sbjct: 273 TGYVSLPELTRANRNYITILINGRFIKNFLLNRAILEGYGNRLMVGRFPFAILSIKIDPT 332
Databases available on-line • Peptide Sequence Databases • nrAll non-redundant GenBank CDS translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF swissprotLast major release of the SWISS-PROT protein sequence database (no updates) patProteins from the Patent division of GenPept. Yeastyeast (Saccharomyces cerevisiae) genomic CDS translations ecoli Escherichia coli genomic CDS translations pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data BankDrosophila genomeDrosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP). monthAll new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days.
Your particular project Two protein samples to compare 500 proteins 200 proteins
Microarray experiments Sample 1: 200 proteins up-regulated in rat heart after treatment with your magic drag Sample 2: 500 proteins expressed in human heart during a severe hereditary disease
Running BLAST on UNIX computer (command line:) more .ncbirc [NCBI] Data=/usr/local/bin/BLAST/data
FORMATDB (auxiliary program) -t Title for database file [String] Optional -i Input file for formatting (this parameter must be set) [File In] -l Logfile name: [File Out] Optional default = formatdb.log -p Type of file T - protein F - nucleotide [T/F] Optional default = T -o Parse options T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. Do not create indexes. [T/F] Optional default = F -a Input file is database in ASN.1 format (otherwise FASTA is expected) T - True, F - False. [T/F] Optional default = F
Running BLAST • Prepare database and query sequences • Type in a command line: blastall -p blastp -d “db_name” -i “query_name” • Another command line: nohup blastall -p blastp -d “db_name” -i “query_name” -o “output_name” -v1 -b1 &
Hyman Hartman & Alexei FedorovThe origin of the eukaryotic cell – a genomic investigation.PNAS 2002, 99, 1420-1425.
BLAST comparison of all proteins from 5 eukaryotes with sequenced genomes Yeast 6,200 Drosophila 14,000 C. Elegans 18,000 Arabidopsis 24,500 Protist ? (contigs) 347 proteins common to all sequence eukaryotes
Our comparison of two sets of 2,461 and 25,470 proteins BLAST output text file
PERL program homolog_unique_score.pl BLAST results INPUT: Query sequences homolog_unique_score.pl OUTPUT: Homologous proteins unique proteins
Running the program Command line: perl homolog_unique_score.pl blast.out test_ibc1 Command line for find find ~ -name homolog_unique_score.pl -print
Jie He project • Create a database for local BLAST • Create an input list of 20 sequences in a Fasta-format for local BLAST • Run local BLAST
Downloading BLAST tar (tape archive) • Command line for packing files: tar -cvfarchive.tar dir_names • Command line for unpacking files: tar –xvf archive.tar Compression and decompression • gzip and gunzip
Assignments: • Read instructions of stand-alone BLAST for UNIX platform from NCBI web-site • Read PERL book about variables
PERL programming $n++? Variables (chapter 2)
Why PERL??? • I heard that C and Java are much faster (what car to buy Corolla or Lamborghini?)
QUIZ • What is the number of neurons in the human brain? • What is the average number of connections between a neuron in the human brain with its neighboring cells? With an estimated 100 billion neurons and 100 trillion synapses in the human brain, creating an all-encompassing map of even a small chunk is a daunting task. (http://www.technologyreview.com/Biotech/19731/)
If we do not know what’s going on inside our brain, then we should not be intimidated by not knowing the signaling system in our computers! Our brain somehow transfers (compiles) our words into electrical waves. The same do computers: they compile words of our programs into binaries that are not readable for us. (Compiler: a computer program that translates an entire set of instructions written in a higher-level symbolic language (as C) into machine language before the instructions can be executed [from Merriam-Webster])
Do our neurons communicate with words? Answer: NO, they communicate with each other via electrical signals
A binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text. Many binary file formats contain parts that can be interpreted as text; binary files that contain only textual data - without, for example, any formatting information - are called plain text files. In ordinary usage they are typically contrasted with binary files, so that binary files are all files which do not contain merely plain text.
Do you know the what’s going on in your brain? I failed my exam because of a fuzzy communication between WS09af and UV3245 groups of neurons in my brain
Introduction to PERL programming • Sets of instructions that are executed in a consequence order • Try to type small piece of PERL script and check whether it works properly. (print the results) • Thousands of Perl programs are available in the internet for free (examples: audio and visual bells)
Your first PERL program #!/usr/bin/perl print “Hello World! \n”; # This is my 1st line print “my second line \n”; # Second line
What does \n mean?What will happen if I change “” to ‘’?Test all your questions without hesitation! #!/usr/bin/perl print “Hello World! \n\n\n”; # triple \n print ‘my second line \n’; #change “” to ‘’
How to print into a file instead of the screen #!/usr/bin/perl open (FILEHANDLE, “>print_test1”); open (SECOND, “>very_long_name”); print FILEHANDLE “Hello World! \n”; print SECOND “my second line \n”;
Variables: numbers and strings Temperature Time Weight of you body Money on your bank account
Modeling your wealth #!/usr/bin/perl $my_money = 1000; #I have 1,000 dollars #variable could have any name: $m or $bank print “I have $my_money \n”; $my_money = $my_money + 100; # I deposited 100 dollars to my account print “Now I have $my_money \n”;
How you can change a variable #if you withdraw 10 dollars: $my_money = $my_money -10; # if your bank pay you 3.65% annual interest $my_money = $my_money * 1.0365;
Interest task for investment of 1,000 dollars • My bank offered me 3.65% annual interest and it will pay me 0.01% daily. • In how many days I will have 2,000 dollars?
PERL program to resolve our task #!/usr/bin/perl $my_money = 1000; #I have 1,000 dollars $day = 0; # variable for day counting $my_money = $my_money * 1.0001; $day = $day +1; print “on a day $day I have $my_money \n”;
#!/usr/bin/perl $my_money = 1000; #I have 1,000 dollars $day = 0; # variable for day counting $my_money = $my_money * 1.0001; $day = $day +1; print “on a day $day I have $my_money \n”; $my_money = $my_money * 1.0001; $day = $day +1; print “on a day $day I have $my_money \n”;
LOOPS #!/usr/bin/perl $my_money = 1000; #I have 1,000 dollars $day = 0; # variable for day counting for $day (0..1000) { $my_money = $my_money * 1.0001; print “on a day $day I have $my_money \n”; }
When’ll I get 2,000 dollars?Condition statement if ($my_money >= 2000) #!/usr/bin/perl $my_money = 1000; #I have 1,000 dollars $day = 0; # variable for day counting for $day (0..1000000) { $my_money = $my_money * 1.0001; print “on a day $day I have $my_money \n”; if ($my_money >= 2000) {last;} }
Programmers are very smart and lazy • Instead of $my_money = $my_money * 0.01; It is easier to write $my_money *=0.01; Instead of $day = $day +1; It is easier to write $day +=1; or $day++;
For the case of 3.65% annually (not on the daily base) #!/usr/bin/perl $my_money = 1000; $day = 0; $year = 0; for $day (0..1000) { if ($day == 365) { $year++; $day = 0; $my_money = $my_money * 1.0365; print “on the year $year I will have $my_money \n”; } if ($my_money >= 2000) {last;} }
Simplification of script #!/usr/bin/perl $m = 1000; $d = $y = 0; for $d (0..1000) { if ($d == 365) { $y++; $d = 0; $m *=1.0365; print "$y \t $m \n"; } if ($m >= 2000) {last;} }