BLAST benchmarks

BLAST benchmarks George Coulouris NCBI/NLM/NIH coulouri@ncbi.nlm.nih.gov June 2005

Motivation and goal • It’s hard to define what constitutes a “typical” search. • NCBI BLAST processes over 150,000 searches per day. • Large scale characteristics of this workload are stable over time. • Goal: Design a test suite that approximates this workload.

Applications • Evaluate the relative performance of BLAST running on different hardware • Evaluate the relative performance of different BLAST implementations

Components • Databases • Queries • Tasks • Driver

Databases • Protein “nr” and nucleotide “nt” account for >80% of all searches; good choice for representative databases. • Sequences are constantly added and removed; databases are updated daily. • The volatility and large size of these databases make them unsuitable for benchmarking purposes.

Databases • Solution: Generate benchmark databases from subsets of “nr” and “nt”. • Non-redundant proteins are sampled from “nr”. • Size ratio of nucleotide to protein databases is preserved to avoid skewing runtime results.

Queries • >90% of protein queries are <1000 residues in length • >90% of nucleotide queries are <2000 base pairs in length • Should cover major model organisms • Solution: Sample 200 queries from refseq_rna and refseq_protein. Resulting set covers many organisms and has a typical length distribution.

Tasks Program distribution: blastn 50% megablast 10% blastp 20% blastx 10% tblastn 5% tblastx 5%

Driver script • Executes 200 searches according to above program distribution. • Runs in 35 minutes on current hardware. • Can be used to measure speed or throughput.

Sample results

BLAST benchmarks

BLAST benchmarks

Presentation Transcript

BLAST

BENCHMARKS

BLAST

Benchmarks

BLAST

BLAST:

Benchmarks

BLAST

Benchmarks

BLAST

Benchmarks

BLAST

Blast

Benchmarks

Benchmarks

BLAST

BLAST

BLAST

BLAST