70 likes | 208 Views
Using Unix Shell Scripts to Manage Large Data. What is Unix shell script?. A collection of unix commands may be stored in a file, and csh /bash can be invoked to execute the commands in that file .
E N D
What is Unix shell script? • A collection of unix commands may be stored in a file, and csh/bash can be invoked to execute the commands in that file. • Like other programming languages, it has variables and flow control statements, e.g., • if-then-else; • while; • for; • goto. • you can run any shell simply by typing its name.
Useful Unix commands • grep: globally searches for regular expressions in files and prints all lines that contain the expression • cut: select fields or characters from each line of a file • head/tail: cut the first/last # lines of a file • wc: count # characters/words/lines of a file • split: read a file and writes it in n line pieces into a set of output files • cat/paste: join files by rows or columns • join: merge two files by a common field • awk: a POWERFUL pattern scanning and processing language
Motivating example • Genome-wide DNA methylation data • ~3000 samples (rows) • ~485,000 sites (columns) • Data came in batches (~300 sample per file, ~1Gb each) • For our analysis, we would like to: • Pool all samples together • but split to ~50,000 sites per file • Load to R? will take ~14GB memory and R takes hours to read each file • Using csh scripts, only takes ~10 minutes
csh script: pool samples #!/bin/csh cd /dir rm -f cpg.txt cp -f All_Beta_Values1.txt cpg.txt foreach m (`seq 2 9`) # count number of samples @ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1 echo "file = ${m}, nrow = $l" rm -f test.txt # remove the header tail -n $l All_Beta_Values${m}.txt > test.txt cat test.txt >> cpg.txt end
csh script: split by sites #!/bin/csh cd /dir foreach n (`seq 1 9`) rm -f beta2950_${n}of10.txt # start @ l = ($n - 1) * 50000 + 2 # end @ r = $n * 50000 + 1 zcatcpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt end zcatcpg.txt.gz | cut -f 1,450002- > beta2950_10of10.txt
Some tips • To check whether a data file contains header or not, whether it is tab- or comma-delimited > head -n 1 filename • To check a selected variable/column (e.g., to see how missing values were coded) > head -n 10 filename | cut -f #,# • To get a subset of samples by matching ID > grep -f ID.txt filename