310 likes | 534 Views
Text Files. Most bioinformatics work involves messing around with text files. DNA and protein sequences, genotypes, databases, results of similarity searches and multiple alignments are all stored on the computer as ordinary ASCII text files. To read, write, and edit these text files you must get
E N D
1. Unix Text Editing and Simple Programming
2. Text Files Most bioinformatics work involves messing around with text files.
DNA and protein sequences, genotypes, databases, results of similarity searches and multiple alignments are all stored on the computer as ordinary ASCII text files.
To read, write, and edit these text files you must get familiar with a Text Editor program
3. What is a Text Editor? A text editor is like a word processor on a personal computer, except that it does not apply formatting styles (bold, italics, different fonts etc.).
Unix has line editors (view and edit one line at a time) and full screen editors.
A screen editor loads an entire document into a buffer - allows you to jump to any point in the document.
4. Unix Text Editors There are many different text editors available for Unix computers.
You can have multiple editors on one system
vi - old, reliable, present on every Unix machine, completely and utterly user hostile
jed - fairly simple, identical to eve on the old VMS system
pico - extremely simple, perhaps too simple
emacs - a compromise between power features and ease of use
5. Emacs The full name of the Emacs program is: "GNU emacs, the Extensible, Customizable, Self-Documenting, Real-time Display Editor.
Emacs is free software produced by the Free Software Foundation (Boston, MA) and distributed under the GNU General Public License.
Open source software - Linux
GNU is an acronym for: GNU is Not Unix
6. Starting emacs To start Emacs, at the > command prompt, just type: emacs
To use Emacs to edit a file, type:
emacs filename
(where filename is the name of your file)
When Emacs is launched, it opens either a blank text window or a window containing the text of an existing file.
8. The Emacs Display The display in Emacs is divided into three basic areas.
The top area is called the text window. The text window takes up most of the screen, and is where the document being edited appears.
Below the text window, there is a single mode line (in reverse type). The mode line gives information about the document, and about the Emacs session.
The bottom line of the Emacs display is called the minibuffer. The minibuffer holds space for commands that you give to Emacs, and displays status information.
9. Emacs Commands Emacs uses Control and Escape characters to distinguish editor commands from text to be inserted in the buffer.
Ctrl -x means to hold down the control key, and type the letter x.
(You don't need to capitalize the x, or any other control character)
[ESC] x means to press the escape key down, release it, and then type x.
10. Save & Exit To save a file as you are working on it, type:
Ctrl-x ť Ctrl-s
To exit emacs and return to the Unix shell, type: Ctrl -x ť Ctrl -c
If you have made any changes to the file, Emacs will ask you if you want to save:
Save file /u/browns02/nrdc.msf? (y,n,!,.,q,C-r or C-h)
Type y to save your changes and exit
If you type n, then it will ask again:
Modified buffers exist; exit anyway? (yes or no)
If you answer no, then it will return you to the file, you must answer yes to exit without saving changes
11. Moving Around The arrow keys on the keyboard work for moving around one line or one character at a time.
Some navigation commands:
Move to the Top of the file: [Esc] <
Move to the End of the file: [Esc] >
Next screen (page down): Ctrl-v
Previous screen (page up): [Esc] v
Start of the current line: Ctrl-a
End of the current line: Ctrl-e
Forward one word: [Esc] f
Backward one word: [Esc] b
12. Type Text Once you move the cursor to the location in the file where you want to do some editing, you can just start typing - just like in an ordinary word processor.
The delete key should work to remove characters and inserted text will push existing text over.
13. Cut, Copy, and Paste You can delete or move blocks of text.
First move the cursor to the beginning (or end) of the block of text.
Then set a mark with: Ctrl-spacebar
Now move to the other end of the block of text and Delete or Copy the block:
Delete: Ctrl-w
Copy: [Esc] w
To Paste a copied block, move to the new location and insert with : Ctrl-y
14. Getting Help in Emacs Emacs has a built in help feature
Just type: Ctrl-h
To get help with a specific command, type: Ctrl-h k keys
(where keys are the command keys that you type for that command)
Emacs has a built in tutorial: Ctrl-h t
this will be an exercise for this weeks computer lab.
15. Emacs Help on the Web Getting Started with Emacs
http://www.cs.ucl.ac.uk/teaching/supportdocs/emacs.htm
by Johnathon Poole,University College London, Dept. of Computer Science
LinuxCentral: Emacs Beginner's HOWTO
http://linuxcentral.com/linux/LDP/HOWTO/Emacs-Beginner-HOWTO.html
The official GNU Emacs Manual
http://www.gnu.org/manual/emacs/html_chapter/emacs_toc.html
Getting Started With the Emacs Screen Editor
http://www.leeds.ac.uk/iss/documentation/beg/beg6.pdf
16. Simple Programs You can use the Unix shell to run simple programs right from the command line.
Use a for loop to run a program on a bunch of files
grep lets you look for certain words in the output files
An if statement allows the program to make decisions:
Repeat if true
Sort if e-value is greater than 0.01
17. for loop We will use the "for" command in the bash shell for an exercise today.
[There are lots of other ways to do this, but I happen to know it this way.]
> for i (*.fasta)
do water -asequence=$i -bsequence=testseq.fas -auto;
done
This will make a pairwise SmithWaterman search with all files in the current directory that have a filename ending in .fasta (remember - logical filenames are important)
18. grep grep is a tool that finds a keyword in a file
We can use it to quickly find sequences that have no matches in a database similarity search
> grep -F -l 'No sequences found' *.fasta
UNIX can be short and sweet when you know what you are doing!
19. if If is a tool that makes decisons
We can use it to sort results from some type of search similarity, pattern match, etc.
if $eval < 0.05; then
echo $seqname > goodmatch.txt;
else echo $seqname > nomatch.txt;
fi
UNIX can be short and sweet when you know what you are doing!
20. until, while Until and while run a loop and make an if decision
$x = 1;
until $x > 4;
do einverted my$x.seq;
$x = $x+1
done
21. Use a Script A Script is a set of Unix commands saved as a text file
so it can be used over and over again
You can run any EMBOSSprogram in a script
This is especially good for connecting up several programs in a pipeline
use the results of one program is input for the next
sort the outputs and create a summary file
22. Make a Script The "for" loop that I just showed was done on the command line
Once it is run, it is gone.
But, you can put the same lines into a text file, save the file, and then run it as a script whenever you need to do a complex operation on a bunch of files
You must change the file permissions to make the text file eXecutable:
chmod u+x yourfilename.txt
23. #!/usr/local/bin/tcsh
foreach i (*.seq)
fastx $i -exp=0.05 in2=pir:* -default
grep -q "No sequences found" $i.fast
if (! $status) then
echo "No hit for $i.seq"
echo $i.seq >> fastx.nohit
else
grep -q "The best scores are:" $id.fastx
if (! $status) then
set line = `grep -n "The best scores are:" $id.fastx|cut -f1 -d:`
end
24. Next Step: A Database Once you have scripted a few hundred FASTA or BLAST searches, you will have a bunch of results files (text files)
With grep and a few other scripting tricks, you can sort the data and summarize in new text files (parsing)- leading to perhaps an Excel spreadsheet
A much more elegant (and scaleable) solution would be to create a database - but that goes beyond what I can teach in this course.
25. Scripting Languages A number of programming languages have been developed to expand the power of Unix scripts.
Perl is particularly favored by biologists
object oriented programming
regular expressions
see book: Beginning Perl for Bioinformatics
by James Tisdall
Other biologists favor Python or Java
26. Shell or Perl Shell scripts use built in functions of the Unix operating system (grep, cut, sort) can be very fast
Much more power and flexibility in a full programming language such as Perl
Rule of thumb use shell script to save a set of operations that you are comfortable using on the command line (a simple loop, 2 or 3 step pipeline), use Perl for more complex work.
27. Interact with Web Pages These scripting languages also make it easy to automate the use of a Web page
Submit a bunch of sequences
Choose options, add information to various fields
Extract the results from the html files that are returned
28. BioPerl Why re-invent the wheel?
Lots of common bioinformatics tasks have already been programmed as modules in Perl.
Grab sequences from GenBank, extract e-values and annotation from Blast results, etc.
Download them from www.bioperl.org
29. Becoming a Unix Power User Learn more Unix commands
http://ss64.com/bash/
Use the shell to execute simple programs
Write scripts
Download and install the latest bioinformatics software
Drive your system manager crazy
or get your own Unix machine
(Linux on an Intel machine or Mac OS-X)
30. Resources Notes for Lincoln Steins course in
Genome Informatics
http://stein.cshl.org/genome_informatics/index.html
BioPerl.org http://bio.perl.org/
PERL for biologists (Kurt Stüber)
http://caliban.mpiz-koeln.mpg.de/~stueber/perl/
Why Biologists Want to Program Computers
by James Tisdall: http://www.oreilly.com/news/perlbio_1001.html
31. Resources for Bio-Computing