370 likes | 499 Views
Computational Skills Course week 1. Mike Gilchrist NIMR May-July 2011. Main topics. The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines
E N D
Computational Skills Courseweek 1 Mike Gilchrist NIMR May-July 2011
Main topics The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines What will not appear and why...
Axioms Google it You cannot break your computer Familiarity breeds content Bugs are always your fault (and can always be found - eventually) There are many ways of doing any give task (but one of them may be much better) Computational biologists tend to obfuscate GNU = Gnu is Not Unix
WEEK ONE Basic concepts of computer based data, the command line, and some essential command line utilities.
At the command line D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt program = program.exe or -rwxr-xr-x 1 migilsysbio 18K Mar 12 16:16 program [migil@INFO-LINUX2 gcc]$ hts_utils
PATH F:\data\projects\khokha\exon-capture\blast\data-12may11.txt Also the PATH environment variable which tells the computer where to look for programs
Computer data a ‘byte’ 10001011010010110010001110110101011010100101000010100101... 128 64 32 16 8 4 2 1 = 35 (base 10) 00000000 = 0 11111111 = 255 256 possible characters in the simplest form of computer data
The ascii ‘alphabet’ 00000000 = 0 \0 (the null terminator) 00001001 = 9 \t TAB 00001010 = 10 \n (line feed) 00001011 = 13 \r (return) 00100000 = 32 ‘ ‘ (space) 00110000 = 48 ‘0’ : : : : 00111001 = 57 ‘9’ 01000001 = 65 ‘A’ : : : : 01011010 = 90 ‘Z’
A typical text data file @HWUSI-EAS582_0299:3:1:87:82/1 GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT +HWUSI-EAS582_0299:3:1:87:82/1 ffbffffff^UYbb\ffbee_M_f]UV^JYRXbSdf`bfff @HWUSI-EAS582_0299:3:1:87:1949/1 CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT +HWUSI-EAS582_0299:3:1:87:1949/1 _ebhhghghgedgceg\fggfghggg_]eZ\WWcacMb_ce @HWUSI-EAS582_0299:3:1:87:1688/1 CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT +HWUSI-EAS582_0299:3:1:87:1688/1 ffbdd`bfffbf`ZfZddddcbfZfff]ffaddYdeffedd @HWUSI-EAS582_0299:3:1:87:1773/1 CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA +HWUSI-EAS582_0299:3:1:87:1773/1 gfgggggfggggfgggZggfddggggfag\fggafgdgggg @HWUSI-EAS582_0299:3:1:87:1000/1 ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC +HWUSI-EAS582_0299:3:1:87:1000/1 ggeggegeggcgggfdaaffdgggggfggggggggggdggd @HWUSI-EAS582_0299:3:1:87:842/1 GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA +HWUSI-EAS582_0299:3:1:87:842/1 gggggggggggggggggggggggfgggggggfggggggggg @HWUSI-EAS582_0299:3:1:88:1826/1 TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG +HWUSI-EAS582_0299:3:1:88:1826/1 cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg
The line ending/platform issue DOS (Windows) Unix/Linux (newer Macs) 582_0299:3:1:87:82/1\n\r GATGTCTGCCCTATCAACTTTCGAT\n\r 582_0299:3:1:87:82/1\n\r fbee_M_f]UV^JYRXbSdf`bfff\n\r 582_0299:3:1:87:82/1\n GATGTCTGCCCTATCAACTTTCGAT\n 582_0299:3:1:87:82/1\n fbee_M_f]UV^JYRXbSdf`bfff\n (really) 582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87.. Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform... If you are moving a file from Linux -> Windows run unix2dos first. >unix2dos <FILENAME>
Command line utilities for manipulating data files http://sourceforge.net/projects/gnuwin32/files/ >grep finds lines matching ‘patterns’ >sed swaps one pattern for another >awk operates on ‘fields’ in a record >cat >tail >more/less >unix2dos/dos2unix AND THEN THERE ARE PIPES... ‘|’ AND REDIRECTS... ‘>’
grep A typical fasta sequence file (Xenopus tropicalis mRNAs) : ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACC AAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGT GCTCCGAGCAGT >ENSXETG00000025789|ENSXETT00000019729|nodal2 ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCC ATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAG AAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATG ATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATAT CCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTG GGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAA TTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT >ENSXETG00000004295|ENSXETT00000014760|neurod4 AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCA TGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTC AGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCA AATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGT TCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTG AGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGA GAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGG TTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAAT GATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA >ENSXETG00000006455|ENSXETT00000014172|camk2g AGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCA ATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAA GCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGG CAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCC AAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGC ATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACT GTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT :
grep grep reads the input file line by line and reports on each line containing the pattern F:dev\sql>grep ACGTACGT my-sequence-file.fasta GAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGG TATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGA TTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAA GGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA : : F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta 76 F:dev\sql>grep nodal my-sequence-file.fasta >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:dev\sql>grep –c “>” my-sequence-file.fasta 28937 F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt
sed stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution. E.g. Replace groups of space with a TAB, etc. F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr
sed F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/" tmp.txt >ENSXETG_0000025789|ENSXETT00000019729|nodal2 >ENSXETG_0000009009|ENSXETT00000019730|nodal3 >ENSXETG_0000025789|ENSXETT00000019728|nodal2 >ENSXETG_0000009008|ENSXETT00000019726|nodal1 >ENSXETG_0000017442|ENSXETT00000037932|nodal5.2 >ENSXETG_0000016779|ENSXETT00000036596|nodal5 >ENSXETG_0000016778|ENSXETT00000036593|nodal6 >ENSXETG_0000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/g" tmp.txt >ENSXETG______25789|ENSXETT______19729|nodal2 >ENSXETG_______9__9|ENSXETT______1973_|nodal3 >ENSXETG______25789|ENSXETT______19728|nodal2 >ENSXETG_______9__8|ENSXETT______19726|nodal1 >ENSXETG______17442|ENSXETT______37932|nodal5.2 >ENSXETG______16779|ENSXETT______36596|nodal5 >ENSXETG______16778|ENSXETT______36593|nodal6 >ENSXETG______23748|ENSXETT______51228|nodal
The pipe F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr F:dev\sql>grep “>” my-sequence-file.fasta|sed "s/nodal/xnr/" >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr
(n)awk Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields. 3 06/04/2011 06/04/2011 Ashleigh Howes OK ahowes@nimr.mrc.ac.uk Immunoreg O'Garra 2 06/04/2011 06/04/2011 Leona Gabrysova OK lgabrys@nimr.mrc.ac.uk Immunoreg O'Garra 1 06/04/2011 06/04/2011 Eric Dang (ok) edang@nimr.mrc.ac.uk Immunoreg O'Garra 07/04/2011 07/04/2011 Guilherme Neves OK gneves@nimr.mrc.ac.uk Molecular Neuro Pachnis 5 08/04/2011 08/04/2011 Mary Wu OK mwu@nimr.mrc.ac.uk Systems Biol Smith 24 08/04/2011 08/04/2011 Jose Saldanha OK jsaldan@nimr.mrc.ac.uk Math Biol Goldstein 8 08/04/2011 08/04/2011 Laurent Dupays OK ldupays@nimr.mrc.ac.uk Dev Biol Mohun 7 08/04/2011 08/04/2011 Veronique Duboc OK vduboc@nimr.mrc.ac.uk Dev Biol Logan 22 09/04/2011 09/04/2011 Madhu Kadekoppala OK mkadeko@nimr.mrc.ac.uk Parasitology Holder 21 09/04/2011 09/04/2011 George Gentsch OK ggentsc@nimr.mrc.ac.uk Systems Biol Smith 13 10/04/2011 11/04/2011 Siggi Sato OK ssato@nimr.mrc.ac.uk Parasitology Holder 10/04/2011 10/04/2011 James Briscoe OK jbrisco@nimr.mrc.ac.uk Dev Neurobiol Briscoe 26 10/04/2011 17/04/2011 Alex Watson OK awatson@nimr.mrc.ac.uk Systems Biol Smith 10/04/2011 Mustafa Khokha OK mustafa.khokha@yale.edu YALE F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt Ashleigh ahowes@nimr.mrc.ac.uk Leona lgabrys@nimr.mrc.ac.uk Eric edang@nimr.mrc.ac.uk Guilherme gneves@nimr.mrc.ac.uk Mary mwu@nimr.mrc.ac.uk : F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed "s/@/ AT /“ Ashleigh ahowes AT nimr.mrc.ac.uk Leona lgabrys AT nimr.mrc.ac.uk Eric edang AT nimr.mrc.ac.uk Guilhermegneves AT nimr.mrc.ac.uk Mary mwu AT nimr.mrc.ac.uk :
Now let’s go and get some data... Download the set of gene locus coordinates for your model organism from BioMart (Ensembl).
Catches and other platform specific stuff Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all. Unix/Linux is case sensitive: Hello != hello ^C (control-C) is your get out of jail free card!
Next week Download and install MySQL…
WEEK TWO Relational databases and the structured query language (SQL), and an introduction to MySQL.
What is a database? A collection of data tables, or other structures, managed by a computer program and queryable by some sort of (usually) standard language. We will look at relational databases where the data are help in strict column based tables, and queried with SQL – the structured query langauge. This is a standard across a wide range of products: MySQL, Oracle, Sybase, MS-SQLServer PostGres, etc. Database server: the computer program that manages everything Database: a defined area which contains a set of related data tables and other entities: rules, datatypes, etc. Table: a single defined object with a fixed (by you) number of columns and any number of rows Datatypes: out of which data structures are built, i.e. you must define the ‘type’ of your data: integer, float, varchar(), text, etc.
A table course_applicants
Another table research_roles
Creating a table [this is SQL!] create table course_applicants ( date_applied datetime, first_name varchar(32), last_name varchar(32), status varchar(4) null, email varchar(32), divison varchar(32), lab varchar(32) null, role_id integer )
Creating a table create table course_applicants ( cap_date_applied datetime, cap_first_name varchar(32), cap_last_name varchar(32), cap_status varchar(4) null, cap_email varchar(32), cap_divison varchar(32), cap_lab varchar(32) null, cap_role_id integer )
Creating a table: at the prompt F:\dev\sql>MySQL –U myId –P myPass >use mydb >go >create table course_applicants ( date_applied datetime, etc… >go >(will now create table in the database)
Creating a table: in a script (better) table_course_applicants.sql use mydb go create table course_applicants ( date_applied datetime, first_name varchar(32), last_name varchar(32), status varchar(4) null, email varchar(32), divison varchar(32), lab varchar(32) null, role_id integer ) go F:\dev\sql>MySQL –U myId –P myPass < table_course_applicants.sql F:\dev\sql>(table is now created)
Entering data into a table insert_applicants.sql use mydb go insert course_applicants ( date_applied, first_name, last_name, email, divison, role_id ) select getdate(), ‘Mike’, ‘Gilchrist’, ‘mgilchr@nimr.mrc.ac.uk’, ‘Sys Bio’, 8 insert course_applicants ( date_applied, first_name, last_name, email, divison, role_id ) select getdate(), ‘Mustafa’, ‘Khokha’, ‘mustafa.khokha@yale.edu’, ‘YALE’, 8 <etc.> <etc.> <etc.> go F:\dev\sql>MySQL –U myId –P myPass < insert_applicants.sql F:\dev\sql>(data is now added to table)
Loading a table from a text data file computing-course-applicants.txt F:\dev\sql>MySQL –U myId –P myPass >use mydb >go >LOAD DATA LOCAL FILE computing-course-applicants.txt >INTO TABLE course_applicants >go
Using SQL to query/analyse/alter the data F:\dev\sql>MySQL –U myId –P myPass >use mydb >go >select * from course_applicants >go >(outputs data to screen) select * from course_applicants order by date_applied select count(*) ‘applicants’, count(distinct division) ‘divisions’ from course_applicants select lab, count(*) from course_applicants group by lab order by count(*) desc select lab, count(*) from course_applicants group by lab having count(*) > 1 order by lab select * from course_applicants where first_name like ‘J%’ select * from course_applicants where status != ‘OK’
Joining tables for more complex queries select first_name, last_name, role from course_applicants, research_roles where course_applicants.role_id = research_roles.role_id (JOIN) select first_name, last_name, role from course_applicants a, research_roles r (table aliases) where a.role_id = r.role_id (JOIN) select cap_first_name, cap_last_name, rro_role from course_applicants, research_roles where cap_role_id = rro_role_id (JOIN) select rro_role ‘role’, count(*) ‘on course’ from course_applicants, research_roles where cap_role_id = rro_role_id (JOIN) group by rro_role order by rro_role
delete and update statements /* CHANGE THE ROLE DESCRIPTION */ update research_roles set role = ‘CDF (Post Doc)’ where role = ‘Post Doc’ update research_roles set role = ‘CDF (Post Doc)’ where role_id = 2 /* REMOVE PIs */ delete course_applicants where role_id = 8 delete course_applicants from research_roles where course_applicants.role_id = research_roles.role_id and research_roles.role = ‘Principal Investigator’ : and research_roles.role like ‘P% I%’
Catches Never use sub-queries: select ... from tableA a, tableB b where a.id = b.id and a.id in (select id from tableC where id like ‘NIMR%’ ) SQL can give you the wrong answer in a plausible way... If your query is taking forever you have probably missed out a join condition. You can always add rows of data easily, but adding columns is hard work, and means you will have to re-write code... Long text data (e.g. sequences) can be tough to handle.
Exercises Design a table, or tables, to hold personal details (names, etc) and contact information (email, phone numbers, etc.) for people in an organisation. Download the set of gene loci and exon data for your organism from BioMart, design some tables for this data, load the data from your data files, and then query the data for the number of multi-exon vs single exon genes, and find the gene with the most number of exons, etc.
Next week Download and install BLAST from NCBI Prepare a two minute verbal presentation of you project