250 likes | 361 Views
Challenges of Analysis for Grid Computing. Charles Loomis (LAL-Orsay) University College London November 25, 2005. Contents. Introduction What is grid computing? Why is it useful for the LHC? LCG/EGEE production service Middleware services Resources available Current usage
E N D
Challenges of Analysis for Grid Computing • Charles Loomis (LAL-Orsay) • University College London • November 25, 2005
Contents • Introduction • What is grid computing? • Why is it useful for the LHC? • LCG/EGEE production service • Middleware services • Resources available • Current usage • Supporting analysis on the grid • Development needed to meet expectations • Use of grid in other application domains • Summary • Opinions are those of the author and may not reflect those of the LCG or EGEE projects!
What is the Grid? • “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high computational capabilities.”The Grid, I. Foster and C. Kesselman, 1998 • Characteristics: • Critical part of the grid is the “middleware”. • Transparent access to all available resources. • Secure access across administrative boundaries. • Enables sharing of resources.
Why the Grid? • User • Reduced (or no) porting to take advantage of remote resources. • More available resources, less time waiting for answers. • Experiment • No reinventing the wheel: reuse of high-level grid services. • Means of coordinating global computing resources. • Institute • More efficient use of hardware. • Reduced outlay for hardware through sharing.
What the Grid is Not! • Unlimited, free resources • Sharing is expected make more resources available at lower cost, but… sharing is a two-way street. • Users (or their institutes) must still provide resources equivalent to their average consumption. • The Borg • Making resources available in the grid is always voluntary. • Administrators can set policies on who can access those resources, when, with what priority, etc. • Magic • Cannot divine the needs of your applications. • Provides mechanism for creating generally useful services, but users must still write application-level code or layer to bind to grid services.
LHC and Grid Computing • The computing needs of the LHC and goals of grid computing are a good match. • Users and resources are globally distributed. • Scale of storage and computing resources requires federations of diverse resources. • 43 PB of mass storage, 37 PB of disk storage • 105k SI2000 of computing • Needs correspond well to base-level grid services. • Batch-like access to computing resources. • Storage of large data sets. • Metadata management for finding data.
LCG • LHC Computing Grid: • Prepare, deploy, and operate the computing environment to allow the physicists to analyze the data from LHC detectors. • Requires: • Storage and management of large amounts of data. • Easy access to data and associated metadata. • Access to local and remote computing resources. • Stable, reliable system for long periods of time: • Large productions of simulation. • Chaotic access for data analysis. • Goals are similar to those of grid computing.
EGEE • Enabling Grids for E-sciencE: • Provide and manage an European grid infrastructure to support researchers from many disciplines. • LCG and EGEE have similar aims: • LCG: world wide collaboration; one field. • Lifetime: ~20 years. • EGEE: European grid; many fields. • Lifetime: 2+2 years. • EGEE-II: Proposed project to maintain infrastructure. • Lifetime: 2 years. • Division of Labor: • LCG: Provides and operates infrastructure. • EGEE: Re-engineers grid software.
Many Other Projects! • Interoperability between middleware and infrastructures is a real concern.
Job Submission Site 1 Information System Replica Catalogs Computing Element Storage Element Resource Broker Site 2 Computing Element User Interface Storage Element Information System Replica Catalog publish status 2. query 3. query Resource Broker 5. retrieve 4. submit 6. retrieve 1. submit User Interface
Security • Public Key Infrastructure • Uses Grid Security Infrastructure (GSI) from Globus. • Authentication (i.e. Who are you?) • Certificate Authorities (CA) • More than 30 CAs. • Covers Europe, North America, and Asia. • Principals: Hosts, People, Services. • Single sign-on: • User generates time-limited proxy. • Proxy used to delegate authority. • Authorization (i.e. What can you do?) • Done by Virtual Organization (VO). • Resources query VO membership server for membership list.
Information System • Information System is Backbone of Grid: • Used as a service index. • Transports status information to broker. • MDS • LDAP-based system provided by Globus. • Augmented by plain-vanilla LDAP for performance (BDII). • Hierarchy of all grid information. • R-GMA • Consumer/Producer model. • Uses relational DB behind. • Uses same information providers as MDS.
Data Management • Data Management • Storage Services • GridFTP (gsiftp) servers being phased out. • Transition to SRM-based services. • Transport protocols: • gsiftp (remote, local access) • rfio, posix (local access) • http, https (limited support) • VO Replica Catalog • Locations of replicated files. • RB uses these catalogs to find viable sites for jobs. • VO Metadata Catalog • Information about data files on grid. • Accessed directly by end-users.
LCG/EGEE Production Service > 200 sites > 20 kCPU > 13 PB http://goc03.grid-support.ac.uk/googlemaps/lcg.html
ATLAS Data Challenge • ATLAS data challenge (Rome, June 2005) • 200 CPU-years used • 380k jobs in total • 1.4M data files, 45TB • 10 people running production • Total success rate: 52% https://edms.cern.ch/document/641261/18
WISDOM Data Challenge • WISDOM: Wide In Silico Docking on Malaria • 67 CPU-years in 37 days • 73k jobs in total • 947GB of data • 5 people running production • Total success rate: 47% • W/O license failures: 65% http://wisdom.eu-egee.fr/
Better Reliability • Success rate ~60% is not adequate. • Painful, but workable for large productions. • Too frustrating for analysis. • Certification • Avoid landing on a “bad” site, but reduces available resources. • Must make software easier to install and configure. • Current ad hoc solution for Site Functionality Tests needs to be generalized and integrated with the matchmaking. • Examples: • SFT (and other batteries of tests) • Application software validation • Site security validation
Chaotic Access • Service challenges and large scale productions stress the grid, but in a very organized manner. • The large-scale analysis which will appear with real LHC data will be much more chaotic. • Need to test how services will respond to this: • Batch systems with thousands of different users. • Storage systems caching large numbers of different files. • Metadata catalogs with large numbers of varied requests. • etc.
Accessible Grid Software • Grid clients required for all common platforms: • People are more efficient working in their usual environment. • Normal test progression is efficient; don’t interfere with this. • Lightweight services for the laptop/workstation: • Changing analysis software or scripts to work in different environments is error-prone and frustrating. • Allow users to see one environment by running lightweight services on their laptop. • Ideally these would be visible in the grid, so that the user only needs to indicate that jobs need more or different resources.
Access Control Lists • Large experiments are always a balance between collaboration and competition. • Analysis tends to be competitive: • Need to use common resources, • But keep certain things private. • Fine-grained Access Control Lists (ACLs) will need to be supported by nearly all services. E.g. • Analysis jobs: who can kill them, reschedule them, …? • Analysis software: who can read the code? • Produced data: who can read, delete, list, … the data?
Priorities • The fair amount of excess capacity on the production service, means most jobs are not significantly delayed. • With large-scale analysis and production in parallel, this will change. • Priorities will be needed: • For computational, storage, and network resources. • Must seamlessly incorporate policies from: • User: e.g. mix of analysis jobs and “service” jobs • Experiment: e.g. critical realignment jobs before analysis jobs • Sites: e.g. local users run with higher priority • Must resolve conflicts between policies. • E.g. high-priority access to CPU, but low-priority, to storage.
Database Issues • Users will need to store information about their analyses in databases. • Location of produced data files. • Metadata concerning those files. • Common services: • Privacy and namespace issues must be resolved. • Private services: • Federation issues must be resolved.
Communication • Effective communication is vital for analysis. • The grid should incorporate communication tools: • e-mail and mailing lists • chat • phone • video • And facilitate their use. For example: • “single sign-on” for all services • automatic management of lists with VO authorization groups • management of MCU for video
Other Applications • Biomedical applications • Public database usage • Large resource needs • Privacy concerns • Quasi-realtime response • Earth sciences • Widely distributed data • “Complex” metadata searches • Commercial software • Quasi-realtime response • Astrophysics • Sharing data between VOs • Computational Chemistry • Large, parallel algorithms
Summary • Grid technology fits well with the needs and constraints of the high-energy physics community. • LCG/EGEE production service • Large number of globally-distributed resources available. • Successfully used by many experiments for large productions. • Will need to grow by 5 times to meet needs of LHC. • Supporting analysis is challenging for the grid: • Reliability must increase significantly. • Better availability of the software on different platforms. • Finer-grained control over access to and use of resources. • Incorporation of new services into the grid.