230 likes | 353 Views
Computational Skills Course week 1. Mike Gilchrist NIMR May-July 2011. WEEK ONE. Introduction and scope of course. The command line, the nature of computer data, and some useful command line tools for modifying text files. Main topics. The command line Data and text files
E N D
Computational Skills Courseweek 1 Mike Gilchrist NIMR May-July 2011
WEEK ONE Introduction and scope of course. The command line, the nature of computer data, and some useful command line tools for modifying text files.
Main topics The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines What will not appear and why...
Axioms Google it You cannot break your computer Familiarity breeds content Bugs are always your fault (and can always be found - eventually) There are many ways of doing any give task (but one of them may be much better) Computational biologists tend to obfuscate GNU = Gnu is Not Unix
WEEK ONE Basic concepts of computer based data, the command line, and some essential command line utilities.
At the command line D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt program = program.exe or -rwxr-xr-x 1 migilsysbio 18K Mar 12 16:16 program [migil@INFO-LINUX2 gcc]$ hts_utils
PATH F:\data\projects\khokha\exon-capture\blast\data-12may11.txt Also the PATH environment variable which tells the computer where to look for programs
Computer data a ‘byte’ 10001011010010110010001110110101011010100101000010100101... 128 64 32 16 8 4 2 1 = 35 (base 10) 00000000 = 0 11111111 = 255 256 possible characters in the simplest form of computer data
The ascii ‘alphabet’ 00000000 = 0 \0 (the null terminator) 00001001 = 9 \t TAB 00001010 = 10 \n (line feed) 00001011 = 13 \r (return) 00100000 = 32 ‘ ‘ (space) 00110000 = 48 ‘0’ : : : : 00111001 = 57 ‘9’ 01000001 = 65 ‘A’ : : : : 01011010 = 90 ‘Z’
A typical text data file @HWUSI-EAS582_0299:3:1:87:82/1 GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT +HWUSI-EAS582_0299:3:1:87:82/1 ffbffffff^UYbb\ffbee_M_f]UV^JYRXbSdf`bfff @HWUSI-EAS582_0299:3:1:87:1949/1 CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT +HWUSI-EAS582_0299:3:1:87:1949/1 _ebhhghghgedgceg\fggfghggg_]eZ\WWcacMb_ce @HWUSI-EAS582_0299:3:1:87:1688/1 CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT +HWUSI-EAS582_0299:3:1:87:1688/1 ffbdd`bfffbf`ZfZddddcbfZfff]ffaddYdeffedd @HWUSI-EAS582_0299:3:1:87:1773/1 CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA +HWUSI-EAS582_0299:3:1:87:1773/1 gfgggggfggggfgggZggfddggggfag\fggafgdgggg @HWUSI-EAS582_0299:3:1:87:1000/1 ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC +HWUSI-EAS582_0299:3:1:87:1000/1 ggeggegeggcgggfdaaffdgggggfggggggggggdggd @HWUSI-EAS582_0299:3:1:87:842/1 GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA +HWUSI-EAS582_0299:3:1:87:842/1 gggggggggggggggggggggggfgggggggfggggggggg @HWUSI-EAS582_0299:3:1:88:1826/1 TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG +HWUSI-EAS582_0299:3:1:88:1826/1 cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg
The line ending/platform issue DOS (Windows) Unix/Linux (newer Macs) 582_0299:3:1:87:82/1\n\r GATGTCTGCCCTATCAACTTTCGAT\n\r 582_0299:3:1:87:82/1\n\r fbee_M_f]UV^JYRXbSdf`bfff\n\r 582_0299:3:1:87:82/1\n GATGTCTGCCCTATCAACTTTCGAT\n 582_0299:3:1:87:82/1\n fbee_M_f]UV^JYRXbSdf`bfff\n (really) 582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87.. Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform... If you are moving a file from Linux -> Windows run unix2dos first. >unix2dos <FILENAME>
Command line utilities for manipulating data files http://sourceforge.net/projects/gnuwin32/files/ >grep finds lines matching ‘patterns’ >sed swaps one pattern for another >awk operates on ‘fields’ in a record >cat >tail >more/less >unix2dos/dos2unix AND THEN THERE ARE PIPES... ‘|’ AND REDIRECTS... ‘>’
grep A typical fasta sequence file (Xenopus tropicalis mRNAs) : ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACC AAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGT GCTCCGAGCAGT >ENSXETG00000025789|ENSXETT00000019729|nodal2 ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCC ATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAG AAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATG ATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATAT CCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTG GGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAA TTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT >ENSXETG00000004295|ENSXETT00000014760|neurod4 AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCA TGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTC AGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCA AATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGT TCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTG AGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGA GAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGG TTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAAT GATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA >ENSXETG00000006455|ENSXETT00000014172|camk2g AGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCA ATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAA GCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGG CAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCC AAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGC ATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACT GTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT :
grep grep reads the input file line by line and reports on each line containing the pattern F:dev\sql>grep ACGTACGT my-sequence-file.fasta GAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGG TATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGA TTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAA GGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA : : F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta 76 F:dev\sql>grep nodal my-sequence-file.fasta >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:dev\sql>grep –c “>” my-sequence-file.fasta 28937 F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt
sed stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution. E.g. Replace groups of space with a TAB, etc. F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr
sed F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/" tmp.txt >ENSXETG_0000025789|ENSXETT00000019729|nodal2 >ENSXETG_0000009009|ENSXETT00000019730|nodal3 >ENSXETG_0000025789|ENSXETT00000019728|nodal2 >ENSXETG_0000009008|ENSXETT00000019726|nodal1 >ENSXETG_0000017442|ENSXETT00000037932|nodal5.2 >ENSXETG_0000016779|ENSXETT00000036596|nodal5 >ENSXETG_0000016778|ENSXETT00000036593|nodal6 >ENSXETG_0000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/g" tmp.txt >ENSXETG______25789|ENSXETT______19729|nodal2 >ENSXETG_______9__9|ENSXETT______1973_|nodal3 >ENSXETG______25789|ENSXETT______19728|nodal2 >ENSXETG_______9__8|ENSXETT______19726|nodal1 >ENSXETG______17442|ENSXETT______37932|nodal5.2 >ENSXETG______16779|ENSXETT______36596|nodal5 >ENSXETG______16778|ENSXETT______36593|nodal6 >ENSXETG______23748|ENSXETT______51228|nodal
The pipe F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr F:dev\sql>grep “>” my-sequence-file.fasta|sed "s/nodal/xnr/" >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr
(n)awk Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields. 3 06/04/2011 06/04/2011 Ashleigh Howes OK ahowes@nimr.mrc.ac.uk Immunoreg O'Garra 2 06/04/2011 06/04/2011 Leona Gabrysova OK lgabrys@nimr.mrc.ac.uk Immunoreg O'Garra 1 06/04/2011 06/04/2011 Eric Dang (ok) edang@nimr.mrc.ac.uk Immunoreg O'Garra 07/04/2011 07/04/2011 Guilherme Neves OK gneves@nimr.mrc.ac.uk Molecular Neuro Pachnis 5 08/04/2011 08/04/2011 Mary Wu OK mwu@nimr.mrc.ac.uk Systems Biol Smith 24 08/04/2011 08/04/2011 Jose Saldanha OK jsaldan@nimr.mrc.ac.uk Math Biol Goldstein 8 08/04/2011 08/04/2011 Laurent Dupays OK ldupays@nimr.mrc.ac.uk Dev Biol Mohun 7 08/04/2011 08/04/2011 Veronique Duboc OK vduboc@nimr.mrc.ac.uk Dev Biol Logan 22 09/04/2011 09/04/2011 Madhu Kadekoppala OK mkadeko@nimr.mrc.ac.uk Parasitology Holder 21 09/04/2011 09/04/2011 George Gentsch OK ggentsc@nimr.mrc.ac.uk Systems Biol Smith 13 10/04/2011 11/04/2011 Siggi Sato OK ssato@nimr.mrc.ac.uk Parasitology Holder 10/04/2011 10/04/2011 James Briscoe OK jbrisco@nimr.mrc.ac.uk Dev Neurobiol Briscoe 26 10/04/2011 17/04/2011 Alex Watson OK awatson@nimr.mrc.ac.uk Systems Biol Smith 10/04/2011 Mustafa Khokha OK mustafa.khokha@yale.edu YALE F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt Ashleigh ahowes@nimr.mrc.ac.uk Leona lgabrys@nimr.mrc.ac.uk Eric edang@nimr.mrc.ac.uk Guilherme gneves@nimr.mrc.ac.uk Mary mwu@nimr.mrc.ac.uk : F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed "s/@/ AT /“ Ashleigh ahowes AT nimr.mrc.ac.uk Leona lgabrys AT nimr.mrc.ac.uk Eric edang AT nimr.mrc.ac.uk Guilhermegneves AT nimr.mrc.ac.uk Mary mwu AT nimr.mrc.ac.uk :
Now let’s go and get some data... Download the set of gene locus coordinates for your model organism from BioMart (Ensembl). Practise using grep, sed and (n)awl on this data. E.g. How many nodal genes are there? Etc.
Simple exercise Here is a small test, driven by a real need.Take the following tiny section of a fasta file of human proteins from NCBI, and put it in your own fasta/text file. >gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTVNGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGAKKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi|10047100|ref|NP_057387.1| WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDIHNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi|10047102|ref|NP_057388.1| ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAGKELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQKVQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP Then see if you can edit it to produce new versions of the fasta file by: (a) removing the NCBI IDs string, i.e. ->>mitogen-inducible gene 6 protein [Homo sapiens]MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTVetc.(b) removing everything except the accession number>NP_061821.1MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTVetc.(c) identify and substitute the species name in the square brackets>gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein (species)MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSAQERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc.
Catches and other platform specific stuff Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all. Unix/Linux is case sensitive: Hello != hello ^C (control-C) is your get out of jail free card!
For next week Download and install MySQL… This seems to work easily on Windows, but may be rather complicated the Mac, make sure you follow instructions fairly closely. The server needs to be ‘running in the background’ for you to be able to log on and create databases and tables, etc.