300 likes | 312 Views
BLAST is a highly refined and specialized search tool used to compare genomic sequences. It is especially useful for identifying the location of a gene of interest within a genome. This tool allows for nucleotide-nucleotide, protein-protein, translated nucleotide-protein, and translated nucleotide-nucleotide comparisons. By using hash-based alignment and pre-computed k-mers, BLAST provides significant speed enhancements. This article explores how BLAST works and its applications in genomics research.
E N D
Review • Basic Local Alignment Search Tool • Initially developed and used to search genomic sequence • Ex) I have a short piece of mRNA/EST that I sequenced -- and I'd like to see where it is in a genome • Especially useful before genome browsers • Before browsers and annotation, used to "check" if the region that contained my gene of interest had been sequenced yet • Still very useful • Initial application • query sequence • target database
Blast output: Blast of Human Promoter VS Mouse Promoter in BMP4 Query= range=chr14:52413865-52418867 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none (5003 letters) >mm3_dna range=chr14:37718906-37723909 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none Length = 5004 Score = 387 bits (195), Expect = e-110 Identities = 258/278 (92%), Gaps = 2/278 (0%) Strand = Plus / Plus Query: 1 catgcaccgactagtcgccgtaccttccaaaaatacccatgggagtctgggcttccctga 60 |||||||| ||||||| | ||||||||||||||||||||||||||||||||||||||||| Sbjct: 3285 catgcaccaactagtctctgtaccttccaaaaatacccatgggagtctgggcttccctga 3344 Query: 61 gtttagtgtaactgctgcccaaactgatgattaaaacactgagtaatcaccatttaccct 120 |||||| ||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 3345 gtttagaataactgctgcccaaactgatgattaaaacactgagtaatcaccatttacccc 3404 Query: 121 gagtaattagagataatgaaacacctctaaaatcaggttatggagaaaggagctgttgga 180 ||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||| Sbjct: 3405 gagtaattagagataatgaaacacctctaaaatcaggttatggagaagggagctgttgga 3464 Query: 181 tcggtttaaaaggtggccttccataaactgttaaaagatttccaaacgttgagaaacagg 240 | |||||||| ||||||||||| |||||||||||| |||| ||||||||| |||||||| Sbjct: 3465 ttggtttaaagggtggccttccgtaaactgttaaaggattcccaaacgtt--gaaacagg 3522 Query: 241 ctgtgtgcagaactgtgtgccagagagagatacctctt 278 ||||| |||| ||||||||| || |||||||||||| Sbjct: 3523 ctgtgagcagggctgtgtgcccgaaggagatacctctt 3560
Review • Highly refined and specialized • nucleotide-nucleotide (blastn), protein-protein (blastp), translated nucleotide against protein (blastx), protein against translated nucleotide (tblastn), translated nucleotide against translated nucleotide (tblastx) • Variations • bl2seq • Wu-BLAST
Blast Algorithms • Remove regions of low complexity • Make k-letter words (k-mer) of query sequence (a hash?) Default for blastn is w=11 • Compare k-mers in query sequence with pre-computed k-mers in target sequence • Note, the k-mers in the target sequence ( database) are pre-computed --> significant speed up
How does BLAST work? An alignment begins with the query sequence converted into "words" (default for blastn is w=11) ATGCCCTGACGCAATGACCTTAAGGAT ATGCCCTGCCG word1 TGCCCTGCCGC word2 GCCCTGCCGCA etc CCCTGACGCAA CCTGACGCAAT : :
Hash-Based Alignment base-10 numbers 5805 = 5*10^3 + 8*10^2 + 0*10^1+5*10^0 k=8 ATGCCTGGGCT A=0, C=1, G=2, and T=3 (base 4 number) ATGCCTGG = 0*4^7+3*4^6+2*4^5+1*4^4+1*4^3+3*4^2+2*4^1+2*4^0 = 14714 Now we can compare “chunks” of sequence much faster - speed increase by factor of 8 Can pre-compute hashes for entire genome, and only compare hashes Premise for popular alignment tools – BLAST, BLAT,and UIcluster
Appendix 1: The Files Produced by Formatdb (From formatdb.html) ------------------------------------------ Using formatdb without the "-o T" indexing option results in three BLAST database files (.nhr, .nin, ,nsq). Using the "-o T" option will result in additional files. If gi's are present in the FASTA definition lines of the source file, there will be four additional files created (.nsd, nsi, nni, nnd). These are ISAM indices for mapping a sequence identifier to a particular sequence in the BLAST database. If gi's are not use there will be only two additional files created (.nsd, .nsi). These files are listed below for both an example nucleotide and a protein sequence database. The actual sequence data is stored in the files with extension "nsq" or "psq". The compression ratio for these files is about 4:1 for nucleotides and 1:1 for protein sequence source files. Extension Content Format --------------------------------------------- Nucleotide database formatted without "-o T" nhr deflines binary nin indices binary nsq sequence data binary
Blast Works After a word alignment between query, and target, blast attempts to extend the alignment in both directions. The alignment continues which contributes to the alignment "score." Once the score falls below a threshold, the extension halts. The resulting alignment is called an HSP, or high scoring pair.
BLAST Example • Processing of BLAST output is not particularly special in terms of Bioinformatics Techniques • It is useful for comparing sequences and identifying "similarity" • similarity – a measure of how many bases/amino acids are shared between 2 sequences • At the 2 extremes: • 100 "percent identity" means 2 sequences are identical • 0 "percent identity" means 2 sequences share no residues • note that 2 random nucleotide sequences, without gaps will have 25% identity • 2 sequences may be 80% identical at the aa level, but only 55% at the nt level • similarity does not describe length • 1 pair is 98% identical for 8 nucleotides • 1 pair is 75% identical for 100 nucleotides • Which pair of sequences is significant and more closely related? • Measures exist that attempt to quantify these issues • Not covered (Statistical Methods for Bioinformatics)
BLASTN 2.2.6 [Apr-09-2003] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= (500 letters) Database: blastdb/yeast.nt 17 sequences; 12,155,026 total letters Searching.done Score E Sequences producing significant alignments: (bits) Value gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome ... 32 1.2 gi|6322623|ref|NC_001143.1| Saccharomyces cerevisiae chromosome ... 32 1.2 gi|7839148|ref|NC_001136.2| Saccharomyces cerevisiae chromosome ... 32 1.2 gi|6323989|ref|NC_001146.1| Saccharomyces cerevisiae chromosome ... 30 4.7 gi|6323501|ref|NC_001145.1| Saccharomyces cerevisiae chromosome ... 30 4.7 gi|7276232|ref|NC_001137.2| Saccharomyces cerevisiae chromosome ... 30 4.7 >gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome XVI, complete chromosome sequence Length = 948061 Score = 32.2 bits (16), Expect = 1.2 Identities = 16/16 (100%) Strand = Plus / Minus Query: 1 tctggcgtgacaccag 16 |||||||||||||||| Sbjct: 511257 tctggcgtgacaccag 511242
Query, Subject, HSP (high scoring pair) >gi|6226515|ref|NC_001224.1| Saccharomyces cerevisiae mitochondrion, complete genome Length = 85779 Score = 40.1 bits (20), Expect = 0.050 Identities = 20/20 (100%) Strand = Plus / Plus Query: 1562 ttaattattaaaaataataa 1581 |||||||||||||||||||| Sbjct: 74540 ttaattattaaaaataataa 74559 Score = 40.1 bits (20), Expect = 0.050 Identities = 20/20 (100%) Strand = Plus / Plus Query: 1562 ttaattattaaaaataataa 1581 |||||||||||||||||||| Sbjct: 2168 ttaattattaaaaataataa 2187
BLAST www.ncbi.nih.gov (web interface) Highlight parameters output in html/text word size number of results quality of hit >hg16_dna range=chr14:52413865-52418867 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none CATGCACCGACTAGTCGCCGTACCTTCCAAAAATACCCATGGGAGTCTGG GCTTCCCTGAGTTTAGTGTAACTGCTGCCCAAACTGATGATTAAAACACT GAGTAATCACCATTTACCCTGAGTAATTAGAGATAATGAAACACCTCTAA AATCAGGTTATGGAGAAAGGAGCTGTTGGATCGGTTTAAAAGGTGGCCTT CCATAAACTGTTAAAAGATTTCCAAACGTTGAGAAACAGGCTGTGTGCAG AACTGTGTGCCAGAGAGAGATACCTCTTGGGCTGTTAAAAGGTGAATTTC CTTCCAAAACATTCTTAAAGTGCGTTTTGTAGAACACCTTGGTGATGACC CTGAGGTAGACCCCAGTAAATGAGCTCACTTCCCTCCTCCCCCCAAGGCC TAAAGACAGAGGTGGCCTGCAGACAGGCTGGGGCCACGTTTTTTACAGTT AGAAAAATCTGTTTCTAGCTCAGAAGCCGTCTTAAGAACCGACACAGCAG TTATTTTTGTTGGTGTTATGGTTGTTGTTGATTTTTTTGTTCATTTTTGG TGTACCTAAGAGTCCTGACACTCTGTCTTTCCAAGGAGATAACCACAGAA • application • database • sequence
Setting up Standalone BLAST for UNIX: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Basically, there are three steps needed to setup the Standalone BLAST executable for the UNIX platform. 1) Download the UNIX binary, uncompress and untar the file. It is suggested that you do this in a separate directory, perhaps called "blast". 2) Create a .ncbirc file. In order for Standalone BLAST to operate, you have will need to have a .ncbirc file that contains the following lines: [NCBI] Data="path/data/" Where "path/data/" is the path to the location of the Standalone BLAST "data" subdirectory. For Example: Data=/root/blast/data The data subdirectory should automatically appear in the directory where the downloaded file was extracted. Please note that in many cases it may be necessary to delimit the entire path including the machine name and or the net work you are located on. Your systems administrator can help you if you do not know the entire path to the data subdirectory. Make sure that your .ncbirc file is either in the directory that you call the Standalone BLAST program from or in your root directory. 3) Format your BLAST database files. The main advantage of Standalone BLAST is to be able to create your own BLAST databases. This can be done with any file of FASTA formatted protein or nucleotide sequences. If you are interested in creating your own database files you should refer to the sections "Non-redundant defline syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA description available from the BLAST search pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).
Installing BLAST on "old" CSS Itaniums http://www.ncbi.nih.gov/Ftp/index.html Because I want to install on a remote machine, I'll use ftp – a program that stands for 'file transfer protocol' Allows me to access files by command line Another version is ncftp – with more features file completion download stats ssh login@login-pa.engineering.uiowa.edu (this will log you into an itanium that is still supported by NCBI). mkdir ~/blast cd blast ncftp ftp.ncbi.nih.gov (or ftp: login ftp, passwd e-mail@youraddress.edu) bin cd blast/executables/LATEST get blast-2.2.10-ia32.linux.tar (this should work on an itanium) ## These are out of date: get blast-2.2.6-ia32-linux.tar.gz (linux) (or blast-2.2.6-hppa-hpux.tar.gz) ~ 15 meg cd ../../../.. cd blast/db/FASTA/ get yeast.nt.gz (~ 3.5 meg) quit gunzip blast-2.2.10-ia32.linux.tar.gz tar –xvf blast* [for HPUX: cd blast-2.2.6] [for HPUX, : mv ../yeast.nt.gz .] gunzip yeast.nt.gz mkdir blastdb mv yeast.nt blastdb ./formatdb -i blastdb/yeast.nt -p F
Linux Example • Download binary (for CSS machines – "x64"): • ftp://ftp.ncbi.nih.gov/blast/executables/release • mkdir ~/blast • mv blast-2.2.18-x64-linux.tar.gz ~/blast • gunzip blast-2.2.18-x64-linux.tar.gz • tar –xvf blast-2.2.18-x64-linux.tar.gz
Linux Example • Download database(s) of choice • ftp://ftp.ncbi.nih.gov/blast/db/ • ftp://ftp.ncbi.nih.gov/blast/db/nt.00.tar.gz (nt.00.tar.gz-nt.05.tar.gz) Note – these are too LARGE for your CSS account • mkdir ~/blast/db • chmod -R 755 db/ • mv nt.*.tar.gz ~/blast/db
Create Our Own DB • Download chrX: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_X/hs_ref_chrX.fa.gz • mv ~/Deskstop/hs_ref_chrX.fa.gz ~/blast/db • gunzip hs_ref_chrX.fa.gz • ../blast-2.2.18/bin/formatdb -i hs_ref_chrX.fa -p F -o T -n hs_ref_chrX.fa • This will create approx 8 files.
Install BLAST locally • Have all we need to run locally • Note, there are further environment settings to make it easier to run from any location on the system – that is in the readme • Need a sequence in a file to blast against database
Install Blast Locally Create file called dmd.hs (in ~/blast) with the following sequence: >gi|5032280|ref|NM_000109.2| Homo sapiens dystrophin (DMD), transcript Dp427c, mRNA CGTTAAATGCAAACGCTGCTCTGGCTCATGTGTTTGCTCCGAGGTATAGGTTTTGTTCGACTGACGTATC AGATAGTCAGAGTGGTTACCACACCGACGTTGTAGCAGCTGCATAATAAATGACTGAAAGAATCATGTTA GGCATGCCCACCTAACCTAACTTGAATCATGCGAAAGGGGAGCTGTTGGAATTCAAATAGACTTTCTGGT TCCCAGCAGTCGGCAGTAATAGAATGCTTTCAGGAAGATGACAGAATCAGGAGAAAGATGCTGTTTTGCA CTATCTTGATTTGTTACAGCAGCCAACTTATTGGCATGATGGAGTGACAGGAAAAACAGCTGGCATGGAA cd ~/blast ./blast-2.2.18/bin/blastall -p blastn -i dmd.hs -d db/hs_ref_chrX.fa -o dmd.hs.out ./blast-2.2.18/bin/blastall -p blastn -i dmd.hs -d db/hs_ref_chrX.fa -o dmd.hs.out –e 0.00001
BLASTN 2.2.15 [Oct-15-2006] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= gi|5032280|ref|NM_000109.2| Homo sapiens dystrophin (DMD), transcript Dp427c, mRNA (350 letters) Database: hs_ref_chrX.fa 22 sequences; 152,577,922 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value ref|NT_011757.15|HsX_11914 Homo sapiens chromosome X genomic con... 694 0.0 ref|NT_011638.12|HsX_11795 Homo sapiens chromosome X genomic con... 40 0.042 ref|NT_011681.15|HsX_11838 Homo sapiens chromosome X genomic con... 38 0.17 ref|NT_079573.3|HsX_79638 Homo sapiens chromosome X genomic cont... 38 0.17 ref|NT_011669.16|HsX_11826 Homo sapiens chromosome X genomic con... 36 0.65 ref|NT_011651.16|HsX_11808 Homo sapiens chromosome X genomic con... 36 0.65 ref|NT_011630.14|HsX_11787 Homo sapiens chromosome X genomic con... 34 2.6 ref|NT_113965.1|HsX_111684 Homo sapiens chromosome X genomic con... 34 2.6 ref|NT_011786.15|HsX_11943 Homo sapiens chromosome X genomic con... 34 2.6
>ref|NT_011757.15|HsX_11914 Homo sapiens chromosome X genomic contig, reference assembly Length = 34879939 Score = 694 bits (350), Expect = 0.0 Identities = 350/350 (100%) Strand = Plus / Minus Query: 1 cgttaaatgcaaacgctgctctggctcatgtgtttgctccgaggtataggttttgttcga 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 31139409 cgttaaatgcaaacgctgctctggctcatgtgtttgctccgaggtataggttttgttcga 31139350 Query: 61 ctgacgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgcataataaa 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 31139349 ctgacgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgcataataaa 31139290 ETC…
What Next? • We could use BLAST to do something useful • Take 400,000 sequence reads, and use blast to identify locations of read sequences in genome • BLAST each sequence individually • Cluster sequences • Generate report • How would we do all of that?
Perl • A program to read thru a big sequence file – and strip out each individual read • Use Perl as a "wrapper" – to run Blast • Parse results (or possibly use a PM to parse results) • Process, aggregate, summarize, compare to control sequence – and report generation
Alternatively …BLAT • BLAST-Like Alignment Tool • http://genome.ucsc.edu/FAQ/FAQblat
From the FAQ http://genome.ucsc.edu/ From a practical standpoint, Blat has several advantages over BLAST: • speed (no queues, response in seconds) at the price of lesser homology depth • the ability to submit a long list of simultaneous queries in fasta format • five convenient output sort options • a direct link into the UCSC browser • alignment block details in natural genomic order • an option to launch the alignment later as part of a custom track
Blast Suite http://genome.ucsc.edu/goldenPath/help/blatSpec.html Blat produces two major classes of alignments: at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts, and at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts. The main programs in the blat suite are: * gfServer – a server that maintains an index of the genome in memory and uses the index to quickly find regions with high levels of sequence similarity to a query sequence. * gfClient – a program that queries gfServer over the network, and then does a detailed alignment of the query sequence with regions found by gfServer. * blat –combines client and server into a single program, first building the index, then using the index, and then exiting. * webBlat – a web based version of gfClient that presents the alignments in an interactive fashion.
Example • Download binaries (or source files and compile) • http://genome-test.cse.ucsc.edu/~kent/exe/ • (for CSS – use the Linux version) blat database query output.psl –out=blast • database and query files may be either: .fa , .nib or .2bit file
system system LIST -- executes any program on the system #!/usr/bin/perl system("/mnt/r0-blastdb/blast-bin/blastall –p blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out"); # this assumes test.txt and test.out are in same directory as your program .
back ticks #!/usr/bin/perl $output = `/mnt/r0-blastdb/blast-bin/blastall –p blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out`; print "$output\n"; # command is passed on, and interpreted by the shell # output of command returned