580 likes | 872 Views
BLAST Programming. Thomas Madden NCBI madden@ncbi.nlm.nih.gov January 28, 2002. The BLAST algorithm. What is BLAST?. B asic L ocal A lignment S earch T ool Calculates similarity for biological sequences. Produces local alignments: only a portion of each sequence must be aligned.
E N D
BLAST Programming Thomas Madden NCBI madden@ncbi.nlm.nih.gov January 28, 2002
The BLAST algorithm O'Reilly Bioinformatics Technology - BLAST Programming
What is BLAST? • Basic Local Alignment Search Tool • Calculates similarity for biological sequences. • Produces local alignments: only a portion of each sequence must be aligned. • Uses statistical theory to determine if a match might have occurred by chance. O'Reilly Bioinformatics Technology - BLAST Programming
BLAST is a heuristic. • A lookup table is made of all the “words” (short subsequences) and “neighboring” words in the query sequence. • The database is scanned for matching words (“hot spots”). • Gapped and un-gapped extensions are initiated from these matches. O'Reilly Bioinformatics Technology - BLAST Programming
BLAST OUTPUT O'Reilly Bioinformatics Technology - BLAST Programming
There are many different BLAST output formats. • Pair-wise report • Query-anchored report • Hit-table • Tax BLAST • Abstract Syntax Notation 1 • XML O'Reilly Bioinformatics Technology - BLAST Programming
BLAST reports at the NCBI Web page. O'Reilly Bioinformatics Technology - BLAST Programming
Formatting Page O'Reilly Bioinformatics Technology - BLAST Programming
Graphical Overview O'Reilly Bioinformatics Technology - BLAST Programming
One-line descriptions O'Reilly Bioinformatics Technology - BLAST Programming
Pair-wise alignments O'Reilly Bioinformatics Technology - BLAST Programming
Query-anchored alignments O'Reilly Bioinformatics Technology - BLAST Programming
Link to Locus-link Link to UniGene Link to taxonomy Future improvements: LinkOut, taxonomic and structure links. O'Reilly Bioinformatics Technology - BLAST Programming
BLAST report designed for human readability. • One-line descriptions provide overview designed for human “browsing”. • Redundant information is presented in the report (e.g., one-line descriptions and alignments both contain expect values, scores, descriptions) so a user does not need to move back and forth between sections. • HTML version has lots of links for a user to explore. • It can change as new features/information becomes available. O'Reilly Bioinformatics Technology - BLAST Programming
Hit-table • Contains no sequence or definition lines, but does contain sequence identifiers, starts/stops (one-offset), percent identity of match as well as expect value etc. • Simple format is ideal for automated tasks such as screening of sequence for contamination or sequence assembly. O'Reilly Bioinformatics Technology - BLAST Programming
There are drawbacks to parsing the BLAST report and Hit-table. • No way to automatically check for truncated output. • No way to rigorously check for syntax changes in the output. O'Reilly Bioinformatics Technology - BLAST Programming
Structured output allows automatic and rigorous checks for syntax errors and changes. O'Reilly Bioinformatics Technology - BLAST Programming
Abstract Syntax Notation 1 (ASN.1) • Is an International Standards Organization (ISO) standard for describing structured data and reliably encoding it. • Used extensively in the telecommunications industry. • Both a binary and a text format. • NCBI data model is written in ASN.1. • Asntool can produce C object loaders from an ASN.1 specification. O'Reilly Bioinformatics Technology - BLAST Programming
Request results Return formatted results Fetch sequence Fetch ASN.1 ASN.1 is used for the NCBI BLAST Web page. server ASN.1 BLAST DB O'Reilly Bioinformatics Technology - BLAST Programming
Different reports can be produced from the ASN.1 of one search. O'Reilly Bioinformatics Technology - BLAST Programming
Hit-table HTML HTML Pair-wise BLAST report ASN.1 Query-anchored BLAST report text text TaxBlast report XML O'Reilly Bioinformatics Technology - BLAST Programming
The BLAST ASN.1 (“SeqAlign”) contains: • Start, stop, and gap information (zero-offset). • Score, bit-score, expect-value. • Sequence identifiers. • Strand information. O'Reilly Bioinformatics Technology - BLAST Programming
Three flavors of Seq-Align,Score-block(s) plus one of: • Dense-diag: series of unconnected diagonals. No coordinate “stretching” (e.g., cannot be used for protein-nucl. alignments). Used for ungapped BLASTN/BLASTP. • Dense-seg: describes an alignment containing many segments. No coordinate “stretching”. Used for gapped BLASTN/BLASTP. • Std-seg: a collection of locations. No restriction on stretching of coordinates. Used for gapped/ungapped translating searches. Generic. O'Reilly Bioinformatics Technology - BLAST Programming
SEQUENCE is an ordered list of elements, each of which is an ASN.1 type. Required unless DEFAULT or OPTIONAL Score Block Score ::= SEQUENCE { id Object-id OPTIONAL , -- identifies Score type value CHOICE { -- actual value real REAL , -- floating point value int INTEGER } } -- integer O'Reilly Bioinformatics Technology - BLAST Programming
Score Block example 2.45905555x10-9 38.1576692 O'Reilly Bioinformatics Technology - BLAST Programming
SEQUENCE OF is an ordered list of the same type of element. Dense-seg definition Dense-seg ::= SEQUENCE { -- for (multiway) global or partial alignments dim INTEGER DEFAULT 2 , -- dimensionality numseg INTEGER , -- number of segments here ids SEQUENCE OF Seq-id , -- sequences in order starts SEQUENCE OF INTEGER , -- start OFFSETS in ids order within segs lens SEQUENCE OF INTEGER , -- lengths in ids order within segs strands SEQUENCE OF Na-strand OPTIONAL , scores SEQUENCE OF Score OPTIONAL } -- score for each seg O'Reilly Bioinformatics Technology - BLAST Programming
Dense-seg example O'Reilly Bioinformatics Technology - BLAST Programming
SET is an unordered list of elements, each of which is an ASN.1 type. Required unless DEFAULT or OPTIONAL. SET OF is an unordered list of the same type of element. Std-seg definition Std-seg ::= SEQUENCE { dim INTEGER DEFAULT 2 , -- dimensionality ids SEQUENCE OF Seq-id OPTIONAL , -- sequences identifiers loc SEQUENCE OF Seq-loc , -- locations in ids order scores SET OF Score OPTIONAL } -- score for each segment O'Reilly Bioinformatics Technology - BLAST Programming
Std-seg example O'Reilly Bioinformatics Technology - BLAST Programming
Demo program (“blreplay”) to reproduce BLAST results from ASN.1 • Start/stops and identifiers read in from ASN.1 (SeqAlign). • Sequences and definition lines fetched from BLAST databases. O'Reilly Bioinformatics Technology - BLAST Programming
Asntool can produce XML from ASN.1 • Really a transliteration, not a new specification • A Document Type Definition (DTD) can also be produced. O'Reilly Bioinformatics Technology - BLAST Programming
ASN.1 and XML validation differences. • XML can be “well-formed” (does not break any XML syntax rules) or “validated” (checked against a DTD). • ASN.1 must always be valid (checked against a specification). O'Reilly Bioinformatics Technology - BLAST Programming
Special purpose XML • NCBI specification does not fit the needs of some users (the sequence is not provided in the SeqAlign, when fetched the sequence is packed 2/4 bp’s per byte). • Possible to produce XML with more/less information or in a different format. • First done as an ASN.1 specification, which is then dumped as XML. O'Reilly Bioinformatics Technology - BLAST Programming
BLAST XML designed to be self-contained. • Query sequence, database sequence, etc. • Sequence definition lines. • Start, stop, etc. (one-offset). • Scores, expect values, % identity etc. • Produced by BLAST binaries and on NCBI Web page. O'Reilly Bioinformatics Technology - BLAST Programming
Overview of the BLAST XML <!ELEMENT BlastOutput ( BlastOutput_program , BLAST program, e.g., blastp, etc BlastOutput_version , version of BLAST engine (e.g., 2.1.2) BlastOutput_reference , Reference about algorithm BlastOutput_db , Database(s) searched BlastOutput_query-ID , query identifier BlastOutput_query-def , query definition BlastOutput_query-len , query length BlastOutput_query-seq? , query sequence BlastOutput_param , BLAST search parameters BlastOutput_iterations BLAST results for each iteration/run )> O'Reilly Bioinformatics Technology - BLAST Programming
<!ELEMENT BlastOutput ( BlastOutput_program , BlastOutput_version , BlastOutput_reference , BlastOutput_db , BlastOutput_query-ID , BlastOutput_query-def , BlastOutput_query-len , BlastOutput_query-seq? , BlastOutput_param , BlastOutput_iterations )> <!ELEMENT BlastOutput_iterations ( Iteration+ )> <!ELEMENT Iteration ( Iteration_iter-num , Iteration number (one for non PSI-BLAST) Iteration_hits? , Hits (one for each database sequence) Iteration_stat? , Search statistics Iteration_message? Error messages )> O'Reilly Bioinformatics Technology - BLAST Programming
<!ELEMENT Iteration ( Iteration_iter-num , Iteration_hits? , Iteration_stat? , Iteration_message? )> <!ELEMENT Iteration_hits ( Hit* )> <!ELEMENT Hit ( Hit_num , ordinal number of the hit, one-offset (e.g., "1, 2..."). Hit_id , ID of db sequence (e.g., "gi|7297267|gb|AAF52530.1|") Hit_def , definition of the db sequence Hit_accession , accession of the db sequence (e.g., "AAF57408") Hit_len , length of the database sequence Hit_hsps? describes individual alignments )> O'Reilly Bioinformatics Technology - BLAST Programming
<!ELEMENT Hit ( Hit_num , Hit_id , Hit_def , Hit_accession , Hit_len , Hit_hsps? )> <!ELEMENT Hit_hsps ( Hsp* )> <!ELEMENT Hsp ( Hsp_num , ordinal number of the HSP, one-offset Hsp_bit-score , score (in bits) of the HSP Hsp_score , raw score of the HSP Hsp_evalue , expect value of the HSP Hsp_query-from , query offset at alignment start (one-offset) Hsp_query-to , query offset at alignment end (one-offset) Hsp_hit-from , db offset at alignment start (one-offset) Hsp_hit-to , db offset at alignment end (one-offset) Hsp_pattern-from? , start of phi-blast pattern on query (one-offset) Hsp_pattern-to? , end of phi-blast pattern on query (one-offset) Hsp_query-frame? , query frame (if applicable) Hsp_hit-frame? , db frame (if applicable) Hsp_identity? , number of identities in the alignment Hsp_positive? , number of positives in the alignment Hsp_gaps? , number of gaps in the alignment Hsp_density? , score density Hsp_qseq , alignment string for the query Hsp_hseq , alignment string for the database Hsp_midline? )> middle line as normally seen in BLAST report O'Reilly Bioinformatics Technology - BLAST Programming
Parsing BLAST XML with Expat. • Expat is a popular free-ware used for parsing XML. • Non-validating. • Simple C (demo) program to parse BLAST output. O'Reilly Bioinformatics Technology - BLAST Programming
Output sizes for a BLASTP search of gi|178628 vs. nr. • Hit-table: 16 kb • Binary ASN.1 (SeqAlign): 35 kb • Text ASN.1 (SeqAlign): 144 kb • XML (SeqAlign): 392 kb • XML: 288 kb • BLAST report (text): 232 kb • BLAST report (html): 272 kb O'Reilly Bioinformatics Technology - BLAST Programming
Specification (i.e., “data model”) issues should not be confused with the question about whether to use ASN.1 or XML. O'Reilly Bioinformatics Technology - BLAST Programming
Structured output is not a panacea. • Design issues must still be addressed. • Semantic issues still exist, e.g. is a start/stop value zero-offset or one-offset. • Data issues still exist, e.g., is the correct sequence shown, are the offsets correct, was the DNA translated with the correct genetic code? O'Reilly Bioinformatics Technology - BLAST Programming
Overview of BLAST code. O'Reilly Bioinformatics Technology - BLAST Programming
NCBI toolkit • Has many low-level functions to make it platform independent; supported under LINUX, many flavors of UNIX, NT, and MacOS. • Contains portable types such as Int2, Int4, FloatHi. • Developer should write a “Main” function that is called by a toolkit “main”. • Contains the BLAST code in the “tools” library. • A C++ toolkit is now being developed. O'Reilly Bioinformatics Technology - BLAST Programming
BLAST code has a modular design. • API for retrieval from databases independent of the compute engine. • Compute engine independent of formatter. O'Reilly Bioinformatics Technology - BLAST Programming
Readdb API can be used to easily extract information from the BLAST databases. • Date produced. • Title of database. • Number of letters, number of sequences, longest sequence. • Sequence and description of an entry. • Function prototypes in readdb.h. O'Reilly Bioinformatics Technology - BLAST Programming
“Main” is called by “main” in the toolkit. Get or display command-line arguments Allocate an object for reading the database Get the ordinal number (zero-offset) of the record given a ‘FASTA’ identifier (e.g., “gb|AAH06766.1|AAH0676”). Fetch the Bioseq (contains sequence, description, and identifiers) for this record Dump the sequence as FASTA. Dump a BLAST record in FASTA format (db2fasta.c): O'Reilly Bioinformatics Technology - BLAST Programming
Set the expect value cutoff to a non-default value. Allocate a BLASTOptionsBlk with default values for the specified program (e.g., “blastp”), the boolean argument specifies a gapped search Perform a BLAST search of the BioseqPtr query_bsp. The BioseqPtr could have been obtained from the BLAST databases, Entrez or from FASTA using the function call FastaToSeqEntry Only a few function calls are needed to perform a BLAST search (doblast.c): O'Reilly Bioinformatics Technology - BLAST Programming
BlastOptionNew BLAST_OptionsBlkPtr BLASTOptionNew (CharPtr progname, Boolean gapped) CharPtr progname: name of program. Legal values are blastp, blastn, blastx, tblastn, and tblastx. Boolean gapped: if TRUE gapped parameters are set, if FALSE ungapped. Non-default values may be specified by changing elements of the allocated structure (typedef in blastdef.h). The most often changed elements (options) are: Nlm_FloatHi expect_value Expect value cutoff Int2 wordsize Number of letters used in making words for lookup table. Int2 penalty Penalty for a mismatch (only BLASTN and MegaBLAST) Int2 reward Reward for a match (only BLASTN and MegaBLAST CharPtr matrix Matrix used for comparison (not BLASTN or MegaBLAST) Int4 gap_open Cost for gap existence Int4 gap_extend Cost to extend a gap one more letter (including first). CharPtr filter_string Filtering options (e.g., “L”, “mL”) Int4 hitlist_size Number of database sequences to save hits for. Int2 number_of_cpus Number of CPU’s to use. O'Reilly Bioinformatics Technology - BLAST Programming