290 likes | 299 Views
Databases & Applications. Jack da Silva, PhD Bioinformatics Specialist NCSC. Overview. Molecular Biology Databases Bioinformatics Applications User Interfaces Research & Development Summary. Molecular Biology Databases. Public Domain NC Initiatives Rest of the World Commercial
E N D
Databases & Applications Jack da Silva, PhD Bioinformatics Specialist NCSC NC BioGrid
Overview • Molecular Biology Databases • Bioinformatics Applications • User Interfaces • Research & Development • Summary NC BioGrid
Molecular Biology Databases • Public Domain • NC Initiatives • Rest of the World • Commercial • NC BioGrid Database Service NC BioGrid
NCSC Public-Domain Databases • High-Performance Bioinformatics Initiative • Major sequence repositories • GenBank, EMBL, DDBJ, etc. • Formatted for GCG & BLAST • ExPASy (Expert Protein Analysis System) Mirror Site • Peptide databases & associated tools • SWISS-PROT Knowledgebase NC BioGrid
Specialized Public-Domain Databases & NC Initiatives • Value-added • Highly annotated (e.g., interactions) • Organism specific (e.g., human) • Molecule specific (e.g., protein) • Data specific (e.g., gene expression) • North Carolina Initiatives • Please come forward NC BioGrid
Commercial Databases • Celera Genomics • Assembled & annotated human & mouse genome databases + • DoubleTwist • Assembled & annotated human genome database • LabBook • OSU Annotated Human Genome Database • Free to Academia • Incyte Genomics • Human transcript database + NC BioGrid
Major Seq. Repositories (7) Comparative Genomics (7) Gene Expression (19) Gene ID & Structure (31) Genetic & Physical Maps (9) Genomic (49) Intermolecular Interactions (5) Metabolic Pathways & Cellular Regulation (12) Mutation (34) Pathology (8) Protein (51) Protein Sequence Motifs (18) Proteome Resources (8) Retrieval Systems & DB Structure (3) RNA Sequences (26) Structure (32) Transgenics (2) Varied Biomedical (18) Molecular Biology Databases Around the World (335) Baxevanis, A.D. 2002. Nucleic Acids Research 30: 1-12. NC BioGrid
NC BioGrid Database Service • Establish service • Housing & updating data • Public-domain & commercial • Virtual data federation • Collaborative effort • High band-width network environment (NCREN) NC BioGrid
Federated Databases • Provide uniform access/view of heterogeneous databases • IBM DiscoveryLink • “Provides a single-format virtual database view of multiple heterogeneous data sources” • Lion bioscience SRS • “The power of SRS lies in its ability to effectively integrate heterogeneous data sources behind a single interface and integration framework.” • Data standards development (e.g., XML) NC BioGrid
I3C Workflow Demo Interoperable Informatics Infrastructure Consortium Demo uses XML-in, XML-out paradigm NC BioGrid
Bioinformatics Applications • Grid-Unaware • Grid-Aware • NC BioGrid Application Service NC BioGrid
Grid-Unaware Applications • Any application can run on a grid server • NCSC High-Performance Bioinformatics Apps • Public-domain apps on other NC servers • Commercial apps on NC servers NC BioGrid
NCSC Applications • High-Performance Bioinformatics • Parallel applications optimized for parallel supercomputers • Accelrys GCG Wisconsin Package (commercial) • BLAST & HT-BLAST • Parallel Clustal & HT Clustal • Parallel Molecular Systematics Apps • ExPASy tools • High-performance molecular modeling packages (commercial) NC BioGrid
Public & Commercial Apps on NC Servers • Any public-domain application • Open source, “Freeware” • Commercial apps will vary in licensing from restrictive to relatively unrestrictive • Please come forward with suggestions NC BioGrid
FEATURE • “Grid-unaware”, public-domain application from the Rus Altman Lab, Stanford • Identifies functional or structural sites of interest in a protein • FEATURE is serial! • Multiple instances run concurrently on NPACI-net LEGION grid test bed • Scanned entire PDB (10,911 structures) in ~10 hrs (177 hrs or 1 wk sequentially) NC BioGrid
FEATURE Analysis NC BioGrid
FEATURE & the Grid • Compiled FEATURE code on LEGION for Intel Linux, DEC Alpha Linux, & Sun Solaris • Registered binaries into “LEGION space” • Provided file specifying where to find input and deposit output • Used legion_run_multi command to spawn multiple instances of FEATURE (np = 50) across nodes, each scanning a single file from the PDB NC BioGrid
Grid-Aware Applications • Not many – production grids don’t exist • TurboBLAST (TurboGenomics) • Commercial • Not marketed specifically to grids • Distributes BLAST search over heterogeneous network of computers NC BioGrid
TurboBLAST NC BioGrid
TurboBLAST NC BioGrid
NC BioGrid Application Service • Establish service • Housing & updating binaries, source code, documentation • Public-domain & commercial • Collaborative effort • High band-width network environment (NCREN) • Cross-referenced to databases (NCBDS) NC BioGrid
One View of the User’s Views • Database centric • Cross-referenced to appropriate applications • Application centric • Cross-referenced to appropriate databases • Analysis centric • References appropriate databases & applications • Suggests workflows NC BioGrid
User Interfaces to the BioGrid • Single sign-on • Simple, graphical • Allow user to “see” everything on grid • Give the impression that resources are on user’s desktop NC BioGrid
UNICORE Grid Technology NC BioGrid
European DataGrid Simulator NC BioGrid
Vanet LEGION Grid Test Bed (US Nodes) NC BioGrid
Research & Development Opportunities • Uniform access/view of data • “Gridize” applications • Database & application services • User interface development • Collaboration required • span academic-commercial boundary NC BioGrid
Summary • The NC BioGrid aims to provide easy, high band-width access to: • databases • applications • Opportunities for collaborative R&D NC BioGrid