280 likes | 443 Views
A new and improved system to order, produce, search, and maintain BLAST databases. Tom M adden IEB seminar May 19, 2011. What is BLAST?. B asic L ocal A lignment S earch T ool Calculates similarity for biological sequences.
E N D
A new and improved system to order, produce, search, and maintain BLAST databases Tom Madden IEB seminar May 19, 2011
What is BLAST? • Basic Local Alignment Search Tool • Calculates similarity for biological sequences. • Produces local alignments: only a portion of each sequence must be aligned. • Uses statistical theory to determine if a match might have occurred by chance.
Projects since last review • BLAST web page redesign and MyBLAST system. • BLAST+ library and applications. • Delta-BLAST. • BLAST Database Pipeline redesign and BlastDBInfo.
Outline • Summary of the problem. • The “current” system. • The “new” system (BlastDB-pipeline). • Future plans.
The problem • There are BLAST databases such as nt, est, htgs, gss, nr produced by the ID team. • The NCBI also provides domain specific databases with contents specified by different NCBI groups. • Example: HTGS with phases 0, 1, 2, and 3 for a specified organism; RefSeqRNAs annotated on genomic RefSeqs included in an annotation run; DNA sequences used in GEO. • BLAST users need to be able to find and search these databases.
BLAST database statistics • 15140 DNA databases, 3903 protein databases. • Largest database: WGS with 190 billion bases. • Smallest database: Escherichia_coli_o157_h7_str__ec4486_WGS with 444 bases. • How many contain only genomic DNA? • How many contain only cDNA? • How many databases for any given taxid? • How many contain only RefSeq entries?
Current system • Built in the last century. • Different groups at the NCBI produce many of the domain specific databases. • They must assemble the required sequences (in FASTA?), produce BLAST database, and then request rdist. • Problems with the current system: • Redundant effort (many groups writing the same script). • No overall tracking of databases. • Issues with full disks, bad scripts, empty files. • Issues with outdated databases. • Issues with documentation.
New system: Blastdb-pipeline • Joint effort of BLAST and ID teams. • Group at NCBI "orders" a database. Contents of the database determined by the group. • Metadata is produced for each database, it can be retrieved through eutils. • Can be used to customize web pages. • Will replace the current system.
BLASTDB-Pipeline (sketch) Database on disk ID Team BLAST search page order/update Blastdbinfo (entrez database) BLASTDB-SVC NCBI staff BLAST DB order/update BLAST search page Produce metadata and statistics
Ordering a database (two and a half ways). • Define database as an entrezquery, example queries: • RefSeqGene: “refseqgene[keyword]” • Geo: “nucleotide_geoprofiles [filter]” • Mouse ESTs: “txid10090[orgn] AND (gbdiv_est[prop])” • Specify a GenColl accession. • Upload a "raw" database. Discouraged, but needed for gnomon, UniVec, etc.
Metadata part 1: The sequence Sequence sources: SNP, GenBank, Gnomon, RefSeq, SRA, trace, PDB, or SwissProt.
What can we learn from an Entrez query? Database: Pongoabelii ESTs Entrez query: txid9601[orgn] AND (gbdiv_est[prop]) Pongoabelii (Taxid: 9601) Type: cDNA Strategy: EST Source: GenBank
Metadata PART 2: the Rest • Species level taxid (e.g, 9606 for Homo sapiens). • BioProject ID. • Title. • Description (extended title). • Genome collection assembly name. • Entrez query. • Keywords.
Other Metadata sources • Genome collections can provide metadata. • Submitter provides metadata for uploaded databases. • Trace has XML dump.
BLASTDB-Pipeline (sketch) Database on disk ID Team BLAST search page order/update Blastdbinfo (entrez database) BLASTDB-SVC NCBI staff BLAST DB order/update BLAST search page Produce metadata and statistics
Blastdbinfo statistics • 821 cDNA databases. • 7320 genomic databases. • 513 RefSeq databases. • 10 Caviaporcellus (guinea pig) databases.
New system supports production databases • GEO • RefSeqGene • SNP • Top-level databases (nt, est, htgs, nr) • SRA • Trace • RefSeq Assembled Genomes
How to use the new system • Discuss the need for a BLAST database with your supervisor. • Look at the “Blastdb-Pipeline end user manual”, available in SharepointatIEB> Molecular Software Section > BLAST > BLAST db dump process redesign • Login to NCBILS once (with NIH username and password). • Have your supervisor email blastsoft@ncbi.nlm.nih.gov and request that you be given permissions in Blastdb-Pipeline. • Submit your database order.
Future plans(wild-eyed speculation) • Find databases and/or pages based upon organism or some other criteria. • Produce on-the-fly reports about a BLAST database. • Add link from BLAST report back to BioProjects. • Add link to WGS master record to a BLAST page.
Acknowledgements • Yan Raytselis • Christiam Camacho • Yuri Merezhuk • Irena Zaretskaya • IlyaDondoshansky • MishaKimelman • Eugene Yaschenko • Anatoly Mnev • Mike DiCuccio • AviKimchi • Paul Kitts • Francoise Thibaud-Nissen • Sergei Resenchuk • Deanna Church • GrishaStarchenko • Aaron Gussman • PramodParanthaman • Mark Johnson • AmanjeevSethi • Jeff Beck • Michael Domrachev • Eric Sayers • Tao Tao • Peter Cooper • Wayne Matten • Scott McGinnis