590 likes | 614 Views
The Biology Workbench – a community tool for teaching and research. Mark A. Miller. Principal Investigator, Biology San Diego Supercomputer Center. SDSC Mission:.
E N D
The Biology Workbench – a community tool forteaching and research • Mark A. Miller Principal Investigator, Biology San Diego Supercomputer Center
SDSC Mission: To serve as a premiere resource for design, development, and deployment of cyberinfrastructure for the national scientific community.
Production Research What is Cyberinfrastructure anyway? Then, after many months or years of struggle…… DataBases Compute Resources Wet Labs Clinical Labs
Development Wet Labs Clinical Labs Grid Resources Production Sequence Tools Structure Tools D.L. Global Data Providers Grid Services Data Capture Portals Workflow Discovery Portal IntegrationSoftware Research DataBases Microarray Tools Personal Electronic Notebook Web Services Compute Resources Data Deposition Portals WetLabs Clinical Labs Cyberinfrastructure (We Think) Life (and Other) Scientists Need
Allocations on Large architectures via NRAC DataStar; TeraGrid; Blue Gene Allocations for Data Collection Storage 1 PB of on-line disc space; 12 PB of tape space User Services Allocation awards are accompanied by personal service to get you going. Everyone receives courteous advice and assistance! Development allocations are awarded on request. Software Services Rocks cluster management tools Rocks cluster management tools Storage Resource Broker (SRB) The Kepler Workflow Tool SDSC Production Resources for HEC and Grid Computing Tools we provide to the community for U.S. NSF: http://www.sdsc.edu/user_services/allocations/
What is the Next Generation Tools for Biology Group? • Both research and development. • Science is the driver. • Use the Resources of SDSC to Focus on: • Activities that can be uniquely conducted at SDSC. • Activities that partner with other institutions. • Activities that are community-building.
Overview: Next Generation Tools for Biology at SDSC • IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). • Current Projects at SDSC: • Cyberinfrastructure for Phylogenetic Research (CIPRES). • The Next Generation Biology Workbench
Overview: Next Generation Tools for Biology at SDSC • IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). • Current Projects at SDSC: • Cyberinfrastructure for Phylogenetic Research (CIPRES). • The Next Generation Biology Workbench
Overview: Next Generation Tools for Biology at SDSC • IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). • Current Projects at SDSC: • Cyberinfrastructure for Phylogenetic Research (CIPRES). • The Next Generation Biology Workbench
Overview: Next Generation Tools for Biology at SDSC • IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). • Current Projects at SDSC: • Cyberinfrastructure for Phylogenetic Research (CIPRES). • The Next Generation Biology Workbench
Next Generation Tools for BiologyCurrent Products: CIPRES middleware CIPRES portal CIPRES/Kepler workflow Biology Workbench
CIPRES middleware • SDK/libraries for Win/Mac/Linux. • CORBA service architecture allows interactive access to tools across platforms. • Currently supports tree inference/improvement. • Can be accessed through Mesquite
Portal for Tree Inference Supports: Parsimony: (PAUP) Max Likelihood: (RAxML, GARLI) Coming Soon: User configurability (via applet) MrBayes POY Sate
CIPRES/Kepler workflow Status: Proof of Concept Systematics Feature Set; In Usability Development Supports: Iteration Check-pointing Data Forking Data Transfer and deposition Web services Provenance Tracking http://www.phylo.org/sub_sections/kepler_workflow/help/creation.htm
The (current) Biology Workbench • Created 1996-1997 at NCSA by Shankar Subramaniam, Eric Jakobsson, Roger Unwin, Brian Saunders, Mark Stupar, Dawn Cotter, Jim Fenton, Curt Jamison, Brad Mills, George Pappas, David Tcheng
The original concept behind BWB: “Wouldn't it be nice if there was a web site that would let me run BLAST, CLUSTALW, etc. on my collection of sequences, or a collection of sequence alignments and let me store the results?”
Current Workbench Properties From a single browser interface, one can access: All calculations provided by the Workbench Server. • 66 individual tools. • Sequences from 33 databases. Individual login password security provided. Data storage area provided for results. No required plug-ins or downloads. Can be (and is) used over phone modem.
Annual WB usage ’00 – ‘03 Jobs Users
Some BW User statistics • 71% of the user base is domestic. • 44% are academic • 15% noncommercial • 11% commercial • 1% government • The 29% international user population represents over 40 countries • 50% of present users employ the BW for government-funded research programs • 48% of BW users are involved in education
Wet Labs Clinical Labs Data Integration Sequence Tools Structure Tools D.L. Global Data Providers Data storage area Microarray Tools Tools Cyberinfrastructure Provided by the Workbench Grid Resources Grid Resources Grid Services Data Capture Portals Workflow Discovery Portal Workbench DataBase IntegrationSoftware Personal Electronic Notebook Web Services Compute Resources Data Deposition Portals WetLabs Clinical Labs
Software Tools Overall Architecture of the Biology Workbench Browser Web Server bw.cgi Wrapper html.pl Ndjinn Indexing Databases Session Storage User Data Storage Databases Databases Databases Databases Databases Databases
Databases Current Data Integration System ? Public DBs Web Server User Data Storage Flat file Swissprot Database NDJINN Parser Chronjob: ftp download Flat file GenBank Database Lookup Table
Ndjinn Multiple Database Search The "Ndjinn Multiple Database Search" allows the user to specify dbs to be searched
Constructing Queries User selected databases may be searched for text. Permitted text searches are “Contains", "Begins With", "Ends With", or is an "Exact Match". Boolean operators "AND", "NOT", or "OR” may also be used: Search order controlled by parentheses. Example: (myoglobin AND human) OR orangutan
Introducing SWAMI • The Next Generation Biology Workbench • (www.ngbw.org)
We'll all be planning out a routeWe're gonna take real soonWe're waxing down our surfboardsWe can't wait for JuneWe'll all be gone for the summerWe're on surfari to stayTell the teacher we're surfin'Surfin' U.S.A.Haggerties and SwamiesPacific PalisadesSan Onofre and SunsetRedondo Beach L.A.All over La JollaAt Waimia BayEverybody's gone surfin'Surfin' U.S.A. Why SWAMI? SWAMI = Master
The User Says: • "There should be a New Biology workbench web site • that can provide better search tools, support protein structure investigations, and allow my students to share files….”
The Developer Hears: • "There should be a web site that can • host all the users biological data — not just sequences • allow them to analyze it using any modern tool they choose."
Grid Services Compute Resources Web Services Structure Tools Sequence Tools D.L. Microarray Tools Global Data Providers New Workbench Architecture Ideas: Take 1. Web Services Registry/ Discovery Personal Electronic Notebook Discovery Portal Workflow IntegrationSoftware LocalDataBases Computing and data management are handled at remote sites Data Deposition Portals WetLabs Clinical Labs Registry/ Discovery
This approach is too loosely coupled! New Workbench Architecture Ideas: Take 1. Web Services Issues: Tools: No control over tool availability. Published tool registries are weak. Robust tool descriptions (UDDI) pose enormous overhead. Data: Can’t query across all data sources. Unknown bandwidth and reliability of remote data sources. API of remote data sources can change without warning.
Reality Strikes: Priorities must be ordered • "There should be a web site that can • host all the users biological data — not just sequences • allow them to analyze it using any modern tool they choose."
"There should be a web site that can • host all users biological data — not just sequences • allow them to analyze it using any modern tool they choose with as many tools as possible with enterprise class stability….." The Developer Concludes:
Workflow Compute Resources Personal Electronic Notebook Discovery Portal Structure Tools Sequence Tools D.L. IntegrationSoftware LocalDataWarehouse Data Deposition Portals Microarray Tools WetLabs Clinical Labs Global Data Providers New Workbench Architecture Ideas: Take 2. Enterprise Solution Computing and data management are handled locally
This approach has too much overhead! New Workbench Architecture Ideas: Take 2. EJB Issues: • Architecture has 8 separate modules. • A change in any module breaks 1- 7 others • Only a developer who can get zen with EJB cancontribute to the development • Modifying a web page becomes a task thata web artist cannot manage alone. • After 12 months of development, we can login?
Reality Strikes Again: Priorities must be re-ordered • "There should be a web site that can • host all users biological data — not just sequences • allow them to analyze it using any modern tool they choose with as many tools as possible with enterprise class stability….."
The User Re-states: • "There should be a web site that can • allow me can provide better search tools, and allow my students to share files and • allow me to analyze it using any modern tool I choose with as many tools as possible with enterprise class stabilityandwith enough stability so I can teach reliably…..as soon as is humanly possible…."
This approach is just right? New Workbench Architecture Ideas: Take 3. Integrated, Stable Solution TomCat/JAVAStruts2/Hibernate/MySQL/Lucene
Lesson Number 1: Get the user requirements right in the beginning
Wet Labs Clinical Labs Improved Data Handling Sequence Tools Structure Tools D.L. Global Data Providers Sequencing Tools The NEW Workbench will improve on the existing functionalities Grid Resources Grid Resources Grid Services Data Capture Portals Workflow Discovery Portal WorkbenchDataWarehouse IntegrationSoftware Personal Electronic Notebook Web Services Compute Resources Data Deposition Portals WetLabs Clinical Labs
User Data Storage Improved Data Handling Browser Web Server bw.cgi Wrapper html.pl Ndjinn Indexing Session Storage Databases Databases Databases Databases Databases Databases Databases • The toolkit is limited by the ability to handleonly sequences and alignments. • The ability to search is limited by storing dataas free (unstructured) text. Flat files Data Providers
Improved Data Handling Improve Search Techniques Lucene indexing allows us to replace the single text match string with the ability to search on specific fields:
Improved Data Handling: User data stored in RDB: Allow user to import and annotate data of many types, including a generic, unknown type. User-entered sequences and results are stored and annotated along with other user selected sequences. Use of the RDB makes it possible to repurpose data easily.
Wet Labs Clinical Labs Sequence Tools Structure Tools D.L. Global Data Providers Improved Tool Selection Sequencing Tools The NEW Workbench will improve on the existing functionalities Grid Resources Grid Resources Grid Services Data Capture Portals Workflow Discovery Portal WorkbenchDataWarehouse IntegrationSoftware Personal Electronic Notebook Web Services Compute Resources Data Deposition Portals WetLabs Clinical Labs
PISE XML SWAMI XML Lesson Number 2: Software development is incredibly expensive. Build nothing you can steal. Steal from the best. Tool Broker Service bw.cgi Software Tools Wrapper html.pl .jsp bw.cgi New Discovery Portal Step 1. Improved User Access to Tools PISE currently has 300+ interfaces Software Tools Browser Web Server User Data Storage Session Storage
Wet Labs Clinical Labs Improved Portal Sequence Tools Structure Tools D.L. Global Data Providers Sequencing Tools The NEW Workbench will improve on the existing functionalities Grid Resources Grid Resources Grid Services Data Capture Portals Workflow Discovery Portal WorkbenchDataWarehouse IntegrationSoftware Personal Electronic Notebook Web Services Compute Resources Data Deposition Portals WetLabs Clinical Labs
User-Requested ToolKits:Structural Biology: Tools to visualize protein structures.Molecular Biology: Tools to assemble contigs. Tools to visualize sequencer output.Role- Based Logins Licensed tools can be mounted for individual users Instructors and students have separate rolesFolder sharing for collaborative work.NO BROWSER PLUGINSNO SUDDEN CHANGES