A new and improved system to order, produce, search, and maintain BLAST databases

A new and improved system to order, produce, search, and maintain BLAST databases Tom Madden IEB seminar May 19, 2011

What is BLAST? • Basic Local Alignment Search Tool • Calculates similarity for biological sequences. • Produces local alignments: only a portion of each sequence must be aligned. • Uses statistical theory to determine if a match might have occurred by chance.

Projects since last review • BLAST web page redesign and MyBLAST system. • BLAST+ library and applications. • Delta-BLAST. • BLAST Database Pipeline redesign and BlastDBInfo.

Outline • Summary of the problem. • The “current” system. • The “new” system (BlastDB-pipeline). • Future plans.

The problem • There are BLAST databases such as nt, est, htgs, gss, nr produced by the ID team. • The NCBI also provides domain specific databases with contents specified by different NCBI groups. • Example: HTGS with phases 0, 1, 2, and 3 for a specified organism; RefSeqRNAs annotated on genomic RefSeqs included in an annotation run; DNA sequences used in GEO. • BLAST users need to be able to find and search these databases.

BLAST database statistics • 15140 DNA databases, 3903 protein databases. • Largest database: WGS with 190 billion bases. • Smallest database: Escherichia_coli_o157_h7_str__ec4486_WGS with 444 bases. • How many contain only genomic DNA? • How many contain only cDNA? • How many databases for any given taxid? • How many contain only RefSeq entries?

Current system • Built in the last century. • Different groups at the NCBI produce many of the domain specific databases. • They must assemble the required sequences (in FASTA?), produce BLAST database, and then request rdist. • Problems with the current system: • Redundant effort (many groups writing the same script). • No overall tracking of databases. • Issues with full disks, bad scripts, empty files. • Issues with outdated databases. • Issues with documentation.

New system: Blastdb-pipeline • Joint effort of BLAST and ID teams. • Group at NCBI "orders" a database. Contents of the database determined by the group. • Metadata is produced for each database, it can be retrieved through eutils. • Can be used to customize web pages. • Will replace the current system.

BLASTDB-Pipeline (sketch) Database on disk ID Team BLAST search page order/update Blastdbinfo (entrez database) BLASTDB-SVC NCBI staff BLAST DB order/update BLAST search page Produce metadata and statistics

Ordering a database (two and a half ways). • Define database as an entrezquery, example queries: • RefSeqGene: “refseqgene[keyword]” • Geo: “nucleotide_geoprofiles [filter]” • Mouse ESTs: “txid10090[orgn] AND (gbdiv_est[prop])” • Specify a GenColl accession. • Upload a "raw" database. Discouraged, but needed for gnomon, UniVec, etc.

Metadata part 1: The sequence Sequence sources: SNP, GenBank, Gnomon, RefSeq, SRA, trace, PDB, or SwissProt.

What can we learn from an Entrez query? Database: Pongoabelii ESTs Entrez query: txid9601[orgn] AND (gbdiv_est[prop]) Pongoabelii (Taxid: 9601) Type: cDNA Strategy: EST Source: GenBank

Metadata PART 2: the Rest • Species level taxid (e.g, 9606 for Homo sapiens). • BioProject ID. • Title. • Description (extended title). • Genome collection assembly name. • Entrez query. • Keywords.

Other Metadata sources • Genome collections can provide metadata. • Submitter provides metadata for uploaded databases. • Trace has XML dump.

Eutils access

uploaded database

BLASTDB-Pipeline (sketch) Database on disk ID Team BLAST search page order/update Blastdbinfo (entrez database) BLASTDB-SVC NCBI staff BLAST DB order/update BLAST search page Produce metadata and statistics

BlastDBInfo query: “guinea pig” AND genomic [SeqType]

BlastDBInfo query: clostridium difficile

Blastdbinfo statistics • 821 cDNA databases. • 7320 genomic databases. • 513 RefSeq databases. • 10 Caviaporcellus (guinea pig) databases.

New system supports production databases • GEO • RefSeqGene • SNP • Top-level databases (nt, est, htgs, nr) • SRA • Trace • RefSeq Assembled Genomes

Blast.cgi produces Assembled Genomes pages

How to use the new system • Discuss the need for a BLAST database with your supervisor. • Look at the “Blastdb-Pipeline end user manual”, available in Sharepointat‪IEB> ‪Molecular Software Section > ‪BLAST > ‪BLAST db dump process redesign‬ • Login to NCBILS once (with NIH username and password). • Have your supervisor email blastsoft@ncbi.nlm.nih.gov and request that you be given permissions in Blastdb-Pipeline. • Submit your database order.

Future plans(wild-eyed speculation) • Find databases and/or pages based upon organism or some other criteria. • Produce on-the-fly reports about a BLAST database. • Add link from BLAST report back to BioProjects. • Add link to WGS master record to a BLAST page.

Finding Databases

Database documentation

Acknowledgements • Yan Raytselis • Christiam Camacho • Yuri Merezhuk • Irena Zaretskaya • IlyaDondoshansky • MishaKimelman • Eugene Yaschenko • Anatoly Mnev • Mike DiCuccio • AviKimchi • Paul Kitts • Francoise Thibaud-Nissen • Sergei Resenchuk • Deanna Church • GrishaStarchenko • Aaron Gussman • PramodParanthaman • Mark Johnson • AmanjeevSethi • Jeff Beck • Michael Domrachev • Eric Sayers • Tao Tao • Peter Cooper • Wayne Matten • Scott McGinnis

A new and improved system to order, produce, search, and maintain BLAST databases

A new and improved system to order, produce, search, and maintain BLAST databases

Presentation Transcript

Presenting a New and Improved StagingWorks

Custom BLAST Databases A Primer

The new and improved…

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs

Databases and Search Engines

Search Engines and Databases

Collagraph New and improved 

Gapped BLAST and PSI-BLAST ： a new generation of protein database search programs

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Four components to a BLAST search

Exercise: BIOINFORMATIC DATABASES and BLAST

System Engineering and Databases

Maintain System Integrity Maintain Equipment and Consumables

BLAST and searching sequence databases

Using BLAST options to refine a search

New And Improved LCD

Gapped BLAST and PSI-BLAST ： a new generation of protein database search programs

System Engineering and Databases

A New and Improved Fair!

Using BLAST options to refine a search

System Engineering and Databases