190 likes | 330 Views
The Collaboratory: computing environments and infrastructure for structural biology research. Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory. What is the Collaboratory?. Technically: an R&D program funded by NIH
E N D
The Collaboratory: computing environments and infrastructure for structural biology research Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory
What is the Collaboratory? Technically: an R&D program funded by NIH • NIH’s definition of a Collaboratory: “A laboratory without walls.” • Pilot program to investigate if collaboration and remote access tools could improve the efficiency of NCRR resources. • Supplement to the NCRR grant that funds the SMB group. • Currently funds three full-time employees in the SMB group: Thomas Eriksson, Ken Sharp, and Tim McPhillips. • Funding has been extended through the end of the NCRR parent grant; the Collaboratory program will be renewed within the context of the parent grant in 2005. In practice: a group-wide effort to create a coherent computational research environment for our users • Goal is to provide users with a coherent, overarching system for collecting data and solving structures--not just a bunch of tools. • Software development, systems management, instrument design, hardware development, beam line automation, maintenance of equipment, etc--all are critical to the Collaboratory. • Everyone in the PX group contributes to the Collaboratory effort.
“Something there is that doesn’t love a wall…” What kind of walls has the Collaboratory removed? • Walls between beam lines: Users can move between beam lines and find the same computer systems, user accounts and file systems wherever they go. • Walls of geographical distance: Users can access the beam line, computing resources, and their data from anywhere in the world. • Walls between collaborators: Local and remote coworkers can see samples, monitor the beam line, view data, and share data collection sessions. • Walls between detectors and disk storage: High performance network and file server allows users to collect data from large area detectors at maximum speed. • Walls between data and solved structures: High performance computers enable users to process their data and solve structures in real time. • And coming down this year: Walls between traditional and web-based applications; walls between users and support staff; and walls between users and archived data.
…but “good fences make good neighbors!” What kind of fences has the Collaboratory put up? • Fences between user groups: Each user group’s data is secure from snooping, theft, and tampering by other groups. • Fences between networks: Computer systems at the beam lines are protected from network disturbances elsewhere at SSRL; instrument control computers are on an isolated network. • Fences that keep users from damaging equipment remotely: Access control and rights restrictions in Blu-Ice make remote control of beam lines safe. • Fences between computer systems and crackers: High level of security means users need not worry about data loss or system downtime due to marauders from the Internet.
Implications of the automated sample mounting system • SSRL cassette design allow hundreds of pre-frozen crystals to be examined without entering the hutch. • Automatic crystal centering system allows the crystal to be aligned automatically in the beam. • In 2003, users of the robot on 11-1 entered the hutch only once to install cassettes in dispensing dewar. • In 2004, users will not be allowed to use robot if they re-enter hutch after cassettes are loaded under staff supervision. • Cassettes of crystals can be shipped to beam line via FEDEX. • Cassettes can be placed in the hutch by staff, allowing users to work remotely. • Local and remote users will have equal access to the hutch when using the robot (i.e., none). • In theory, many users of the sample mounting robot need not come on site at all. BUT -- Need appropriate computing, network, and software infrastructure to enable remote access to full experimental capabilities of beam lines.
Collaboratory tools and sample mounting robots will allow SSRL users to work completely remotely in 2004 Blu-Ice for beam line control • Can run locally or remotely. • Multiple copies may run simultaneously. • Security features prevent unsafe actions. Beam line video system • Monitor sample in beam, experimental hardware, and crystals under microscope. • Video streams may be viewed via Blu-Ice or through a web browser. Archive System • Back up data to multi-terabyte robot tape system at SDSC over network. • Simple web interface for data archival and retrieval. • No need to use backup tapes. Remote Unix desktop • Fully functional Unix desktop environment. • Blu-Ice and all data processing software may be run remotely. • Free ICA client from Citrix.
Why a high capacity, long term data archive is needed Need a replacement for tapes • Tapes age and medium formats change rapidly. • Storage capacity and reliability of tapes limited. • Much manual book-keeping is needed to keep track of data stored on tapes. Need to support large-area CCD detectors • Three Q315 detectors and a MAR 325 will each be generating 20-70 MB of image data every 5 seconds when the SPEAR3 upgrade is complete. • RAID data storage at SSRL will be 24 TB in 2004--all that data must be backed up somehow! • Need to archive data as rapidly as it is collected. Need to support high-throughput structural biology • Automated beam lines will generated huge amounts of data. • Large numbers of samples and targets require that metadata be stored and tracked systematically. • Data must be archived automatically and easy to retrieve.
High Performance Storage System and Storage Resource Broker at SDSC High Performance Storage System (HPSS) • Long term data storage system at SDSC. • Currently stores over 344 TB of data in 18 million files. • Currently provides 0.9 PB of storage. Storage Resource Broker (SRB) • Client-server middleware for accessing heterogeneous resources over the network. • May be used to store and retrieve data on the HPSS at SDSC. • Powerful metadata querying system allows data sets to be accessed based on their attributes. • Data sets can be replicated over multiple resources. The challenge • Capabilities of HPSS and SRB far exceed the perceived needs of our beam line users. • Educating users to effectively use these systems for managing their data is a challenge. • Our users need a customized interface with simplified functionality.
InQ SRB client for Microsoft Windows SRB client applications • Users must be able to upload data, download data, and view the data in the archive. • Users perform these functions via SRB client applications. InQ for Microsoft Windows • InQ is the easiest to use client provided by SDSC. • Individual files or entire folders may be uploaded or downloaded. • Files in the archive may be browsed either by directory structure or by data attributes. Limitations of InQ • Runs only on Microsoft Windows platforms. • Windows is not the major platform used at synchrotron light sources or in crystallography research labs. • No batch job capability for long archive jobs. • Exposes confusing SRB features and terminology (resources, containers, collections, etc).
MySRB web browser-based SRB client MySRB • MySRB is a powerful web-based SRB client. • Can be run from standard web browsers. • Files in the archive may be browsed either by directory structure or by data attributes. Limitations of MySRB • No way to upload or download more than one file at a time. • The otherwise rich functionality and powerful features are confusing to users. The bottom line: • Additional infrastructure must be designed and implemented in order to make the SRB a viable storage system for crystallographic data. • A browser-based user interface is ideal.
The Collaboratory interface for using the SRB archive Convenient web browser interface • Users may define archive jobs over the web from anywhere in the world using any common type of computer. • Users need only log in to the Collaboratory portal with their Unix account name and password. Simple archive job definition • Users may rapidly browse their data sets at SSRL. • Directory contents are listed in the browser window. • Directories may be navigated by clicking on directory names. • Files to be uploaded may be filtered according to a list of wildcards. • Subdirectories may be archived recursively. • The only SRB related information required is the name of the new data collection to create.
Monitoring archive jobs and downloading data Batch operation • Archive job runs in background once definition is confirmed. • Browser does not hang during archival. • New jobs may be started while previously defined jobs are in progress. • A job status page indicates definitions and status of all running jobs. • E-mail is sent to the user when a job is complete. Similar interface for data download • Users browse their archived data sets in exactly the same fashion. • Data may be downloaded from the archive to a directory at SSRL (analogous to an upload job). • Another option is to download selected files in one or more tar files directly to any computer on the Internet.
Significant infrastructure is required to provide this “simple” interface--but the payoff is huge. Authentication Gateway Server • Java servlet that provides a common authentication protocol for all Collaboratory applications. • Used to authenticate archive system users. • All web-based Collaboratory software are being updated to use this single authentication server. • Support for the authentication server has already been integrated into Blu-Ice/DCS. • Allows users to navigate between web applications seamlessly without authenticating multiple times. • Will allow access to be controlled based on the beam port schedule. • Will allow users to start web-based applications from within Blu-Ice without requiring the user to authenticate again within the browser. Impersonation Server • Unix daemon that can run any non-interactive program on behalf of any Unix user. • Enables web applications to run background jobs for a user with the actual rights of the Unix user account. • Accepts commands via the HTTP protocol. • Verifies authentication information with the Authentication Server. • Used by the Collaboratory archive system to list directories in the web browser and run background archive jobs as the user. • Will enable fluorescence scans and autochooch to be executed by the scripting engine in DCSS. • Will allow further analyses to be initiated by the beam line control system automatically.
Projects for the next year Integration of web-based Collaboratory tools • A new web-based environment for monitoring beam lines and viewing results will be developed over the next year. • The diffraction image viewer, beam line video web application, and archive system will be integrated into this system. • Will enable real-time monitoring of beam line operations and experimental results via the web. • Layout of user interface will likely mimic Blu-Ice’s tab look and feel to leverage user familiarity and experience. • Currently investigating tools for rapidly developing powerful web-based applications in a component-based framework (e.g., WebObjects).
Projects for the next year Web-based proposal management system • Provide all SSRL users with web-browser based tools for submitting proposals and beam time requests; updating personal information; and viewing personalized beam time schedules. • Facilitate communication with user administration and user support staff. • Integrate with production SSRL database system, eliminate older user interfaces and reporting tools. • SSRL will run a separate instance of the Authentication Gateway Server for this purpose. • Users will be able to use this system to specify which Unix accounts are enabled to collect data at the beam line when a particular proposal is active. No more editing the MySQL table! • First new interfaces will be rolled out by the end of 2003; major features will likely be released in late 2004.
Collaboratory projects for the next 5 years… Ice-Floe • Provide users with the databases, user interfaces, and project management capabilities required to make maximum use of high-throughput structural biology resources. • Present users with a high-level interface to automated beam lines and automated structure determination systems. • Enable user to focus on the workflow of carrying out their research rather than the details of each operation. Ice-Breaker • Develop an open protocol for communicating with beam line automation systems. • Work with developers at other light sources to make protocol compatible across a large fraction of structural biology beam lines worldwide. • Enable anyone to develop their own interface to automated beam lines, support in-house LIMS, interface to other software packages, etc. • Allow users to choose the interface most useful to them, independent of the light source.
Where we’re going: data grids, compute grids and experimental resource grids