250 likes | 261 Views
This article discusses the challenges in analysis for grid computing, including the introduction of grid computing, its usefulness for the LHC, available resources, current usage, and the development needed to meet expectations. It also explores the use of grid in other application domains.
E N D
Challenges of Analysis for Grid Computing • Charles Loomis (LAL-Orsay) • University College London • November 25, 2005
Contents • Introduction • What is grid computing? • Why is it useful for the LHC? • LCG/EGEE production service • Middleware services • Resources available • Current usage • Supporting analysis on the grid • Development needed to meet expectations • Use of grid in other application domains • Summary • Opinions are those of the author and may not reflect those of the LCG or EGEE projects!
What is the Grid? • “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high computational capabilities.”The Grid, I. Foster and C. Kesselman, 1998 • Characteristics: • Critical part of the grid is the “middleware”. • Transparent access to all available resources. • Secure access across administrative boundaries. • Enables sharing of resources.
Why the Grid? • User • Reduced (or no) porting to take advantage of remote resources. • More available resources, less time waiting for answers. • Experiment • No reinventing the wheel: reuse of high-level grid services. • Means of coordinating global computing resources. • Institute • More efficient use of hardware. • Reduced outlay for hardware through sharing.
What the Grid is Not! • Unlimited, free resources • Sharing is expected make more resources available at lower cost, but… sharing is a two-way street. • Users (or their institutes) must still provide resources equivalent to their average consumption. • The Borg • Making resources available in the grid is always voluntary. • Administrators can set policies on who can access those resources, when, with what priority, etc. • Magic • Cannot divine the needs of your applications. • Provides mechanism for creating generally useful services, but users must still write application-level code or layer to bind to grid services.
LHC and Grid Computing • The computing needs of the LHC and goals of grid computing are a good match. • Users and resources are globally distributed. • Scale of storage and computing resources requires federations of diverse resources. • 43 PB of mass storage, 37 PB of disk storage • 105k SI2000 of computing • Needs correspond well to base-level grid services. • Batch-like access to computing resources. • Storage of large data sets. • Metadata management for finding data.
LCG • LHC Computing Grid: • Prepare, deploy, and operate the computing environment to allow the physicists to analyze the data from LHC detectors. • Requires: • Storage and management of large amounts of data. • Easy access to data and associated metadata. • Access to local and remote computing resources. • Stable, reliable system for long periods of time: • Large productions of simulation. • Chaotic access for data analysis. • Goals are similar to those of grid computing.
EGEE • Enabling Grids for E-sciencE: • Provide and manage an European grid infrastructure to support researchers from many disciplines. • LCG and EGEE have similar aims: • LCG: world wide collaboration; one field. • Lifetime: ~20 years. • EGEE: European grid; many fields. • Lifetime: 2+2 years. • EGEE-II: Proposed project to maintain infrastructure. • Lifetime: 2 years. • Division of Labor: • LCG: Provides and operates infrastructure. • EGEE: Re-engineers grid software.
Many Other Projects! • Interoperability between middleware and infrastructures is a real concern.
Job Submission Site 1 Information System Replica Catalogs Computing Element Storage Element Resource Broker Site 2 Computing Element User Interface Storage Element Information System Replica Catalog publish status 2. query 3. query Resource Broker 5. retrieve 4. submit 6. retrieve 1. submit User Interface
Security • Public Key Infrastructure • Uses Grid Security Infrastructure (GSI) from Globus. • Authentication (i.e. Who are you?) • Certificate Authorities (CA) • More than 30 CAs. • Covers Europe, North America, and Asia. • Principals: Hosts, People, Services. • Single sign-on: • User generates time-limited proxy. • Proxy used to delegate authority. • Authorization (i.e. What can you do?) • Done by Virtual Organization (VO). • Resources query VO membership server for membership list.
Information System • Information System is Backbone of Grid: • Used as a service index. • Transports status information to broker. • MDS • LDAP-based system provided by Globus. • Augmented by plain-vanilla LDAP for performance (BDII). • Hierarchy of all grid information. • R-GMA • Consumer/Producer model. • Uses relational DB behind. • Uses same information providers as MDS.
Data Management • Data Management • Storage Services • GridFTP (gsiftp) servers being phased out. • Transition to SRM-based services. • Transport protocols: • gsiftp (remote, local access) • rfio, posix (local access) • http, https (limited support) • VO Replica Catalog • Locations of replicated files. • RB uses these catalogs to find viable sites for jobs. • VO Metadata Catalog • Information about data files on grid. • Accessed directly by end-users.
LCG/EGEE Production Service > 200 sites > 20 kCPU > 13 PB http://goc03.grid-support.ac.uk/googlemaps/lcg.html
ATLAS Data Challenge • ATLAS data challenge (Rome, June 2005) • 200 CPU-years used • 380k jobs in total • 1.4M data files, 45TB • 10 people running production • Total success rate: 52% https://edms.cern.ch/document/641261/18
WISDOM Data Challenge • WISDOM: Wide In Silico Docking on Malaria • 67 CPU-years in 37 days • 73k jobs in total • 947GB of data • 5 people running production • Total success rate: 47% • W/O license failures: 65% http://wisdom.eu-egee.fr/
Better Reliability • Success rate ~60% is not adequate. • Painful, but workable for large productions. • Too frustrating for analysis. • Certification • Avoid landing on a “bad” site, but reduces available resources. • Must make software easier to install and configure. • Current ad hoc solution for Site Functionality Tests needs to be generalized and integrated with the matchmaking. • Examples: • SFT (and other batteries of tests) • Application software validation • Site security validation
Chaotic Access • Service challenges and large scale productions stress the grid, but in a very organized manner. • The large-scale analysis which will appear with real LHC data will be much more chaotic. • Need to test how services will respond to this: • Batch systems with thousands of different users. • Storage systems caching large numbers of different files. • Metadata catalogs with large numbers of varied requests. • etc.
Accessible Grid Software • Grid clients required for all common platforms: • People are more efficient working in their usual environment. • Normal test progression is efficient; don’t interfere with this. • Lightweight services for the laptop/workstation: • Changing analysis software or scripts to work in different environments is error-prone and frustrating. • Allow users to see one environment by running lightweight services on their laptop. • Ideally these would be visible in the grid, so that the user only needs to indicate that jobs need more or different resources.
Access Control Lists • Large experiments are always a balance between collaboration and competition. • Analysis tends to be competitive: • Need to use common resources, • But keep certain things private. • Fine-grained Access Control Lists (ACLs) will need to be supported by nearly all services. E.g. • Analysis jobs: who can kill them, reschedule them, …? • Analysis software: who can read the code? • Produced data: who can read, delete, list, … the data?
Priorities • The fair amount of excess capacity on the production service, means most jobs are not significantly delayed. • With large-scale analysis and production in parallel, this will change. • Priorities will be needed: • For computational, storage, and network resources. • Must seamlessly incorporate policies from: • User: e.g. mix of analysis jobs and “service” jobs • Experiment: e.g. critical realignment jobs before analysis jobs • Sites: e.g. local users run with higher priority • Must resolve conflicts between policies. • E.g. high-priority access to CPU, but low-priority, to storage.
Database Issues • Users will need to store information about their analyses in databases. • Location of produced data files. • Metadata concerning those files. • Common services: • Privacy and namespace issues must be resolved. • Private services: • Federation issues must be resolved.
Communication • Effective communication is vital for analysis. • The grid should incorporate communication tools: • e-mail and mailing lists • chat • phone • video • And facilitate their use. For example: • “single sign-on” for all services • automatic management of lists with VO authorization groups • management of MCU for video
Other Applications • Biomedical applications • Public database usage • Large resource needs • Privacy concerns • Quasi-realtime response • Earth sciences • Widely distributed data • “Complex” metadata searches • Commercial software • Quasi-realtime response • Astrophysics • Sharing data between VOs • Computational Chemistry • Large, parallel algorithms
Summary • Grid technology fits well with the needs and constraints of the high-energy physics community. • LCG/EGEE production service • Large number of globally-distributed resources available. • Successfully used by many experiments for large productions. • Will need to grow by 5 times to meet needs of LHC. • Supporting analysis is challenging for the grid: • Reliability must increase significantly. • Better availability of the software on different platforms. • Finer-grained control over access to and use of resources. • Incorporation of new services into the grid.