Custom BLAST Databases A Primer

Shawn Houston houston@alaska.edu UAF Life Science Informatics Custom BLAST DatabasesA Primer

Custom BLAST Databases • Why? • To limit your search domain • To use your unique sequences • Automate your blast searches • Pipeline • Workflow • How? • Linux • It's what I do...

Custom BLAST Databases • What do I need? • Input in either FASTA or ASN.1 format • I will focus on FASTA • NCBI Toolkit • formatdb • BLAST binary downloads include formatdb formatdb [-] [-B filename] [-F filename] [-L filename] [-T filename] [-V] [-a] [-b] [-e] [-i filename] [-l filename] [-n str] [-o] [-p F] [-s] [-t str] [-v N] DESCRIPTION formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp or MegaBLAST. The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that if you are going to apply periodic updates to your BLAST databases using fmerge(1), you will need to keep the source database file.

FASTA Format >This is an entry headeratcgtcgattgatgtcgtgatcgtagtcgtagctgatgactgtatgctgcatgtgctaaaaacatgctagct • Important NoteNCBI only considers the first 32 characters in a FASTA header significant and NCBI provided tools will decide if a sequence is unique using only these.

The FASTA Header • >dbi|accnum| my header • An NCBI Recognized Database ID GenBank gb|accession|locusEMBL Data Library emb|accession|locusDDBJ, DNA Database of Japan dbj|accession|locusNBRF PIR pir||entryProtein Research Foundation prf||nameSWISS-PROT sp|accession|entry nameBrookhaven Protein Data Bank pdb|entry|chainPatents pat|country|numberGenInfo Backbone Id bbs|numberGeneral database identifier gnl|database|identifierNCBI Reference Sequence ref|accession|locusLocal Sequence identifier lcl|identifier

The FASTA Header 2 • Do not leave any space between '>' and the NCBI Database ID • gnl and lcl can be your friend • fastacmd • Retrieves sequences from a blast formated database in FASTA format by accession number • Free form headers are allowed • Do not forget the 32 character “limit” • Some things will not work (fastacmd, etc)

But... I use Windows! • DOS file line endings • CR/LF • Apple • CR or LF • Linux (Unix) • LF • dos2unix, tr -d '\r' < dosfile > unixfile, perl -pi -e's/\r\n/\n/g yourfile, etc.

Formatting Your Database • Let us assume we have a text formated file containing FASTA format nucleotide sequences, myfile.fa • Let us assume we have a command line, cygwin, Apple Terminal, Linux, HP-UX, … $ formatdb -pF -imyfile.fa • What do I get? myfile.fa.nhr, myfile.fa.nin, myfile.fa.nsq

Formatting Your Database 2 • But I am not using accession numbers or database identifiers... $ formatdb -pF -oF -imyfile.fa • This produces the same files that work in the same way, except... • No internal accession index • No internal database identifier

Using Your New Database • Copy or move myfile.fa.nhr, myfile.fa.nin, myfile.fa.nsq to their final resting place • Let's use it! • We need an input sequence or sequences, FASTA format, in one file, myseq.fa $ blastall -pblastn -imyseq.fa -d/mypath/myfile.fa -omyblast.out

Let's Get Some Data • You might have some data already, or • NCBI • http://www.ncbi.nlm.nih.gov/ • Biomirror • http://www.bio-mirror.net/ • EMBL • http://www.ebi.ac.uk/embl/ • DDBJ • http://www.ddbj.nig.ac.jp/

Let's Get Some Data 2 • http://xml.nig.ac.jp/tutorial/rest/index.html#l2.1 use LWP::UserAgent;$ua = new LWP::UserAgent;# make request$req = new HTTP::Request POST => 'http://xml.ddbj.nig.ac.jp/rest/Invoke';$req->content_type('application/x-www-form-urlencoded');# set parameters$req->content('service=GetEntry&method=getDDBJEntry&accession=AB000100');# send request and get response.$res = $ua->request($req);# If you want to get a large result. It is better to write to a file directly.# $res = $ua->request($req,'file_name.txt');# show response.print $res->content;

Let's Get Some Data 3 • ftp://ftp.ncbi.nih.gov/genbank/genomes/Fungi/ • Aspergillus_fumigatus • Aspergillus_nidulans_FGSC_A4 • Candida_albicans • Candida_dubliniensis_CD36 • Candida_glabrata_CBS138 • Cryptococcus_neoformans_var_JEC21 • Debaryomyces_hansenii_CBS767 • ...

Where To Go From Here • $ man formatdb • $ man blastall • $ blastall - • HTML Documentation • But, I don't have NCBI Tools installed! • Get your computer support people to do this if you can, otherwise you can download binaries from ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.23/

Still Going... • There are no instructions for installing NCBI binaries • On Linux the BLAST data files go in /usr/share/ncbi/data • There are a lot of BLAST programs • blastall • Blast • megablast • C++ Version (blastn, blastp, etc)

Are We Done? • Questions • Comments • Demo • ftp://folders.inbre.alaska.edu/FMP/BLASTdbDemo/ ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.23/ • Conclusion(s) • This is easy! (keep repeating until you believe) • ???????

Custom BLAST Databases A Primer

Custom BLAST Databases A Primer

Presentation Transcript

A Pepper Primer

A Primer

Searching Molecular Databases with BLAST

BLAST

Beyond PubMed and BLAST: Exploring NCBI tools and databases

BLAST

A Primer

h ave a BLAST!

A Primer

Custom Databases

BLAST

BLAST:

Having a BLAST

BLAST – A heuristic algorithm

Exercise: BIOINFORMATIC DATABASES and BLAST

Blast

BLAST and searching sequence databases

BLAST – A heuristic algorithm

BLAST