320 likes | 418 Views
OxGrid, a campus grid environment for Oxford. Dr. David Wallom Technical Manager. Outline. Aims of OxGrid? How have we made OxGrid? Central Systems Software Resources Users The future direction of the project. Aim. To develop and deploy Grid technology to
E N D
OxGrid, a campus grid environment for Oxford Dr. David Wallom Technical Manager
Outline • Aims of OxGrid? • How have we made OxGrid? • Central Systems • Software • Resources • Users • The future direction of the project
Aim • To develop and deploy Grid technology to • increase utilisation of current and future university resources • substantially increase the research computing power available • capitalise on our status as a core node of the NGS and evangelize usage of grids • lead international best practice effort at running production grids • provide user authentication through either UK e-Science certificates or university single sign-on system
OxGrid, a University Campus Grid • Single entry point for users to shared and dedicated resources • Seamless access to NGS and OSC for registered users
OxGrid Central System • Resource Broker • Distribution of submitted tasks • User access point • Information Service • Central repository of system status information • Virtual Organisation Management and Resource Usage Service • User/resource control within the grid • Record and analyse accounting information so that full system as well as single resource usage can be recorded • Systems monitoring • Monitoring system for support, providing first point of user contact • Storage • Dynamic virtual file system independent of rest of the system
Grid Middleware • Virtual Data Toolkit • Contains • Globus Toolkit version 2.4 with several enhancements • GSI enhanced OpenSSH • myProxy Client & Server • Has a defined support structure
Resource Broker • Built on top of Condor-G • Allows treatment of a Globus (i.e. remote) resource as a local resource • Command-line tools available to perform job management (submit, query, cancel, etc.) with detailed logging • Simple job submission language which is translated into remote scheduler specific language • Custom script for determination of resource status & priority • Integrated the Condor Resource description mechanism and Globus Monitoring and Discovery Service • Automated resource discovery • Underlying capability matched to system database including such information as installed software and system load high watermark
Job Submission Script • Separates users from underlying Condor system • Requires the following information to submit a task • Executable name, • Transfer exe (y/n) • Command line arguments to exe • Input files • Output files • Necessary installed software • Extra Globus RSL parameters e.g. MPI Job and number of concurrent processes • When job is submitted script contacts system database to retrieve list of systems user has permission to use
Additional User Tools • oxgrid_certificate_import • Simplifies the installation of a user digital certificate to a single command • oxgrid_q • Display the users current queue at the resource broker. Has the options to allow the user to see the full task queue. • oxgrid_status • Displays the resources that are available to the user with options for all resource currently registering with the resource broker • oxgrid_cleanup • Removes either a single submitted process or a range of child processes with their master
Virtual Organisation Management • Globus uses a mapping between certificate Distinguished name to local usernames on each resource • Important that for each resource that a user is expecting to use, his DN is mapped locally • Make sure the correct resources are registered • OxVOM • Postgres database, web server, CGI scripts • Custom in-house designed Web based user interface • Persistent information stored in relational databases • User DN list retrieved by remote resources using standard tools from ldap database
Resource Usage Service and Accounting • Jobmanagers altered to include commands to determine job start and stop time as well as interface with host scheduling system • Information returned from client to RUS server when job completed and stored in relational database • Stored information for a single job includes • start & end time • Execution host, scheduler • CPU & walltime time • Memory used • Resource owner controlled cost variable • Tune usage from campus grid • Version 2 will use GGF Usage Record standard
Resource Usage Service • Enables presentation of system use to users as well as system owners • Can form the basis of a charging model
Core Resources • Available to all users of the campus grid • Individual Departmental Clusters (PBS, SGE) • Grid software interfaces installed • Management of users through pool accounts or manual account creation. • Clusters of PCs • Running Condor/SGE • Single master running up to ~500 nodes • Masters run either by owners or OeRC • Execution environment on 2nd OS(Linux), Windows or Virtual Machine
External Resources • Only accessible to users that have registered with them • National Grid Service • Peered access with individual systems • OSC • Gatekeeper system • User management done through standard account issuing procedures and manual DN mapping • Controlled grid submission to Oxford Supercomputing Centre • Some departmental resources • Used as method to bring new resources initially online • Show the benefits of joining the grid • Limited accessibility to donated by other departments to maintain incentive to become full participants
Services necessary to connect to OxGrid • For a system to connect to OxGrid • Must support a minimum software set (without which it is impossible to submit jobs from the Resource Broker) • GT2 GRAM and RUS reconfigured jobmanager • MDS compatible information provider • Desirable though not mandated • OxVOM compatible grid-mapfile installation scripts • Scheduling system to give fair-share to users of the resource
Environmental Condor • Cost & environmental considerations of using ‘spare’ resources (£7000/yr for OUCS Condor pool) • New daemon for Condor • System starts and stops registering systems depending on currently queued tasks. • Currently only works with Linux Condor systems
Current Compute System Layout • Central management services running on single server • Current resources: • All Users • OUCS Linux Pool (Condor, 250 CPU) • Oxford NGS node (PBS, 128 CPU) • Condensed Matter Physics (Condor, 10 CPU) • Theoretical Physics (SGE,14 CPU) • OeRC Cluster (SGE, 5 CPU) • High Energy Physics (LCG-PBS,120 CPU) not registering with RB • Registered users • OSC (Zuse, 40 CPU) • NGS (all nodes, 342 CPU) • Biochemistry (SGE, 30 CPU) • Materials Science (SGE, 20 CPU)
Planned System Additions • Physics Mac teaching laboratory (end Nov) • OUCS Mac systems (end Nov, have agreement just need time!) • Humanities cluster (Nov) • Statistics cluster (end Dec) • Biochemistry remaining two clusters (end Dec) • OSC SRIF3 cluster tranche 1 (2007) • Chemistry clusters (contacted department) • NGS2 All resources (2007)
Data Management • Engagement of data as well as computational users, • Provide a remote store for those groups that cannot resource their own, • Distribute the client software as widely as possible to departments that are not currently engaged in e-Research.
Data Management • Two possible candidates for creation of system • Storage Resource Broker to create large virtual datastore • Through central metadata catalogue users interface with single virtual file system though physical volumes may be on several network resources • In built metadata capability • Disk Pool Manager • Similar virtual disk presentation, • Internationally recognised using SRM standard interface, • No metadata capability, • Integrated easily with VO server.
Supporting OxGrid • First point of contact is OUCS Helpdesk through support email. • Given a preset list of questions to ask and log files to ask to see if available. • Not expected to do any actual debugging. • Pass problems onto Grid experts who then pass problems on a system by system basis to their own maintenance staff. • Significant cluster support expertise within IeRC. • As one of the UK e-Science Centres we also have access to the Grid Operations and Support Centre.
Users • Focused on users with ‘serial computation’ problems, individual researchers • Statistics (1 user) • Materials Science (3 user) • Inorganic chemistry (3 users) • Theoretical Chemistry (4 users) • Biochemistry (8 users) • Computational Biology (2 user) • Condensed Matter Physics (2 users) • Quantum Computational Physics (1 user)
User Code ‘Porting’ • User forwards code to OeRC that operates either on single node or cluster. • Design a wrapper script • Creates scratch directory in which all operations occur • formats configuration information for each child process from main configuration • Creates execution script and ‘zip’ file for remote execution • Submits child process onto grid • Waits until all child processes have completed • Collates results and archives temp files etc. • Deposits scratch directory into SRB repository • <Can remove scratch directory from Resource Broker if asked> • Hand code back to user as an example of a real computational task they want to do but a possible basis for further code porting by themselves
OxGrid, Users Simulation of the quantum dynamics of correlated electrons in a laser field. OxGrid & NGS made serious computational power easily available and was crucial for making the simulating algorithm work.Dr Dmitrii Shalashilin (Theoretical Chemistry) Orbitals and Electron Charge Distribution in Boron Nitride NanostructuresDr. Amanda Barnard, (Materials Science) Molecular evolution of a large antigen gene family in African trypanosomes. OxGrid has been key to my research and has allowed me to complete within a few weeks calculations which would have taken months to run on my desktop.Dr Jay Taylor (Statistics)
Problems • Sociological • Getting academics to share resources • IT officers in departments and colleges • Technical • Minimal firewall problems • Information servers • OS Versions • Programming languages • Time
The Future • Improve central service software • RB usage algorithm • Remove central information server • Resource broker querying individual remote systems is actually more efficient • Update Condor-G to latest version to allow seamless transition from Pre-WS to WS based middleware • Design and construct user training courses
The Future, 2 • Develop Windows/Linux Condor pools so that all shared systems can be included • Develop experimental system to harvest spare disk spins so as to ensure complete ROI on shared systems. • Connect MS Windows Cluster system • Package central server modules for public distribution • Already running on systems in Porto and Barcelona universities as well as Monash University • Continue contacting users to expand the user base
Conclusions • Users are already able to log onto the Resource Broker and schedule work onto the NGS and OUCS Condor Systems • Working as quickly as possible to engage more users • These users will encourage their local systems owners (in departments and colleges) to donate resource! • Need these users to then go out and evangelise.
Thanks • Co-Designer of parts of the system • Jon Wakelin (CeRB) • Oxford Sys Administrators • Ian Atkin (OUCS) • Jon Lockley (OSC) • Steven Young (Ox NGS) • Users • Amanda Barnard (Materials Science) • Dr Jay Taylor (Statistics) • Dr Dmitry Shalashilin (Theoretical Chemistry)
Contact • Email: david.wallom@oerc.ox.ac.uk • Telephone: 01865 283378