100 likes | 200 Views
BLAST benchmarks. George Coulouris NCBI/NLM/NIH coulouri@ncbi.nlm.nih.gov June 2005. Motivation and goal. It’s hard to define what constitutes a “typical” search. NCBI BLAST processes over 150,000 searches per day. Large scale characteristics of this workload are stable over time.
E N D
BLAST benchmarks George Coulouris NCBI/NLM/NIH coulouri@ncbi.nlm.nih.gov June 2005
Motivation and goal • It’s hard to define what constitutes a “typical” search. • NCBI BLAST processes over 150,000 searches per day. • Large scale characteristics of this workload are stable over time. • Goal: Design a test suite that approximates this workload.
Applications • Evaluate the relative performance of BLAST running on different hardware • Evaluate the relative performance of different BLAST implementations
Components • Databases • Queries • Tasks • Driver
Databases • Protein “nr” and nucleotide “nt” account for >80% of all searches; good choice for representative databases. • Sequences are constantly added and removed; databases are updated daily. • The volatility and large size of these databases make them unsuitable for benchmarking purposes.
Databases • Solution: Generate benchmark databases from subsets of “nr” and “nt”. • Non-redundant proteins are sampled from “nr”. • Size ratio of nucleotide to protein databases is preserved to avoid skewing runtime results.
Queries • >90% of protein queries are <1000 residues in length • >90% of nucleotide queries are <2000 base pairs in length • Should cover major model organisms • Solution: Sample 200 queries from refseq_rna and refseq_protein. Resulting set covers many organisms and has a typical length distribution.
Tasks Program distribution: blastn 50% megablast 10% blastp 20% blastx 10% tblastn 5% tblastx 5%
Driver script • Executes 200 searches according to above program distribution. • Runs in 35 minutes on current hardware. • Can be used to measure speed or throughput.