300 likes | 438 Views
Grid2003 Report. John Hicks TransPAC HPCC Engineer Indiana University. HENP Meeting – Hawaii 25-January-2004. Overview. Introduction to Grid2003 (Grid3) Experiments Grid3 Software Grid3 Monitoring efforts Supercomputing 2003 Questions. Introduction to Grid3.
E N D
Grid2003 Report John Hicks TransPAC HPCC Engineer Indiana University HENP Meeting – Hawaii 25-January-2004
Overview • Introduction to Grid2003 (Grid3) • Experiments • Grid3 Software • Grid3 Monitoring efforts • Supercomputing 2003 • Questions
Introduction to Grid3 • Grid3 is a coordinated project between US LHC experiments (US ATLAS, US CMS), grid projects (iVDGL, GriPhyN, PPDG), and computing projects (LIGO, SDSS, BTeV) • Purpose of Grid3 is to build a multi-experiment multi-VO grid environment • Test the infrastructure and services for production and analysis of scientific experiments • Provide a platform for technology demonstrators Grid3 is supported by the National Science Foundation and the Department of Energy
The Grid3 Project • Grid3 is running at 28 sites • The peak processor count is ~2800 CPUs • There are 6 virtual organizations (VO) • SDSS • ATLAS • iVDGL • USCMS • LIGO (now LIGO Scientific Collaboration, LSC) • BTeV • There are currently 11 application • Resources are dynamically roll-in/out • Applications are dynamically installed • Grid3 provides a base for a persistent grid
Science Applications • Each VO provides and maintains its applications • Applications do not require privileged access to be installed or to operate • Reserved areas for applications, data stage-in/out, temporary files, are available • Installation location information is published in MDS • Multiple versions of an application may exist • HEP, CS demonstrators, Astrophysics, Biology applications
Grid3 Experiments: USATLAS The US ATLAS group consists of 31 universities and 3 national laboratories. It is participating in the building and operation of the ATLAS (A Toroidal LHC Apparatus) experiment to be installed in one of the interaction regions at the Large Hadron Collider (LHC) at CERN, Geneva Switzerland. http://www.usatlas.bnl.gov
Grid3 Experiments: USCMS USCMS is a collaboration of US scientists participating in the Compact Muon Solenoid (CMS) experiment at the Lepton Hadron Collider (LHC) at CERN in Geneva, Switzerland. http://www.uscms.org
Grid3 Experiments: LIGO The Laser Interferometer Gravitational-Wave Observatory (LIGO) is a facility dedicated to the detection of cosmic gravitational waves and the harnessing of these waves for scientific research. It consists of two widely separated installations within the United States — one in Hanford Washington (left) and the other in Livingston, Louisiana (right) — operated in unison as a single observatory. http://www.ligo.caltech.edu
Grid3 Experiments: SDSS The Sloan Digital Sky Survey (SDSS) is a collaboration of scientists and engineers to map one-quarter of the entire sky, determining the positions and absolute brightnesses of more than 100 million celestial objects. http://www.sdss.org
Grid3 Experiments: BTeV The BTeV experiment is designed to challenge the Standard Model explanation of CP violation, mixing and rare decays of beauty and charm quark states. http://www-btev.fnal.gov
Grid3 Software • Pacman • Packing and installation software • Main deployment tool for Grid3 • All software pacmanized • VDT • The Virtual Data Toolkit (VDT) is a set of grid software that can be easily installed and configured. The goal of the VDT is to make it as easy as possible for users to install grid software • It includes fundamental grid software, Virtual Data software, and utilities
Job submission and data transfer • Globus Toolkit • The Globus Toolkit is an open source toolkit used for building grids. The toolkit components can be used independently or together to develop applications. These components help support and manage elements like: Security, Fault Detection, Information infrastructure, Portability, Resource management, Data management, Communication • MDS - a directory service used to publish configuration information • RLS - The replica location service (RLS) maintains and provides access to mapping information from logical names for data items to target names. These target names may represent physical locations of data items, or an entry in the RLS may map to another level of logical naming for the data item. • Condor • Condor is an open source work management system for compute-intensive jobs which provides : A job queuing mechanism, Scheduling policy, Priority scheme, Resource monitoring, Resource management
Job submission and data transfer (cont.) • VDS • The Virtual Data System (VDS Chimera/Pegasus/Sphinx/DAGMan) is open-source software which provides a method for storing the representation of computational procedures used to generate data, those procedures themselves and the datasets produced by them. This allows the auditing and lineage of derived data to be recorded and the automatic on-demand re-derivation of said data. This is important in large collaborations where it may be more difficult to determine how particular data was generated.
User Management • Virtual Organization Membership Service • VOMS (Virtual Organization Membership Service) is open-source software which provides information on a user's membership within a virtual organization (VO). A virtual organization is an abstract entity grouping Users, Institutions and Resources into the same administrative domain. A User's membership in a VO indicates that he may have permissions to utilize resources at individual institutions. • Grid User Management System • Develop Model for Distributed User Registration Work With Existing VO Management Tools including EDG VOMS servers used in Grid2003 Help Define Requirements for New & Improved VO Tools Focus on Site Tools for User Management
Information Services • MDS Based • Schemas/Information needed • MDS core, GLUE (Grid Laboratory Universal Environment) • Grid3 • Site specific information on Grid3 ($GRID3, $APP, $DATA, $TMP, $TMP_WIN) • VO specific information on Grid3 (run time environments needed to run VO specific applications) • Vo and application Specific
iVDGL Grid Operations Center (iGOC) • The iGOC is currently located at Indiana University • The iGOC provides 24x7x365 operational support backed by Services Level Agreements (SLA) • Support includes: • Problem alert, tracking, and trouble ticket support • Support for systems which host the Globus Index Information Service (GIIS), VOMS Database Service, Replica Location Service (RLS), and Monitoring Tools • Grid3 monitoring is coordinated through the iVDGL operations group and the iGOC
Monitoring/Interactive Analysis services • Ganglia • Open source tool to collect cluster monitoring information such as CPU and network load, memory and disk usage • MonALISA • Monitoring tool to support resource discovery, access to information and gateway to other information gathering systems • ACDC (Advanced Computational Data Center) Job Monitoring System • Application using grid submitted jobs to query the job managers and collect information about jobs. This information is stored in a DB and available for aggregated queries and browsing. • Metrics Data Viewer (MDViewer) • analyzes and plots information collected by the different monitoring tools, such as the DBs at iGOC. • Distributed Interactive Analysis of Large datasets (DIAL) • provides connection between interactive analysis tools (like JAS, ROOT) and data processing applications (like ATHENA).
Site Catalog VO GIIS GIIS Ganglia ACDC JobDB MonALISA ML repository IS Clients IS Clients IS Clients Client tools Monitoring services Intermediaries Producers Consumers WWW OS (syscall, /proc) GRIS Reports Log files System config. Job manager MonALISA client User clients MDViewer
SNMP … Monitoring services (2) Outputs Web Information providers Web GIIS Ganglia ACDC Job DB VO GIIS MDS GRIS Web ML repository MonALISA Report Job sched Web Server DB MDViewer Report agents Information consumers
Interactive analysis services • Metrics Data Viewer (MDViewer) • analyzes and plots information collected by the different monitoring tools, such as the DBs at iGOC • Distributed Interactive Analysis of Large datasets (DIAL) • provides connection between interactive analysis tools (like JAS, ROOT) and data processing applications (like ATHENA) • Differentiate the possible information sources for MDViewer (other DBs, log files, …) and provide different GUIs (e.g. servlet) • Make DIAL Grid enabled and to add a dataset catalog to it
Grid3 status tool • Choose the sites from the catalog • Site list, available resources • Availability test • Site specific information
grid3 db CRON JOB update_db.pl get hostnames gits_output.xmlindividual test results for each host run gits script index.php presents detailed view of test results catalog.php map.php Update each test results in database Look at each test result for each site and determine final resultpass/fail Update ‘results’ table withfinal result igoc.config user template.igoc (pl) Get overall results Create map Site Status tool
Monitor Job execution • Check the submitted/running/held jobs • Verify the increased load • Control the traffic • Look the expected completion time
Grid3 at SC2003 • Users point of view • Ease to become a Site (well defined instruction, responsive mailing list for support) • Ease to package an application for the Grid (well defined example to follow, will provide automatic installation, submission - biology group at ANL prepared for grid execution in less than 1 week – using Chimera-Pegasus) • ATLAS validated the full chain event generation, simulation, reconstruction, analysis (Higgs event observed during SC03) • CMS currently using grid3 for effective production – more than 40000 CPU*day used by VOs in the last 2 months (real jobs, no tests)
Submissions during SC2003 week • Total number of jobs submitted during SC2003 week: ~ 3400 • successful (data produced, transferred to SE, registered in RLS): ~2300. • Row statistics, can be improved resubmitting the jobs which failed due to different reasons. • 1. Simulation jobs: SUB OK • "Higgs" sample (200evts): ~1500 ~1020 • "Top" sample (200evts): ~1200 ~600 • 2. Reconstruction jobs: • "Higgs" sample (200evts): ~710 ~675 • These data has been analyzed by David Adams using DIAL. The production chain resulted validated by the reconstruction of a Higgs trace • Different errors, sometime with unknown cause, others due to changes in resource availability, failed transfer or registration, competition in shared resources (RAM), certificate issues (DOEgrid/DOEsciencegrid)
Statistics per VO • Met Targets • Data transferred per day>1 TB • Number of concurrent jobs >1100 (11/20/03) • Number of users>100 • Number of different applications >11 • Number of sites running multiple applications >10 • Rate of Faults/Crashes < 1/hour • Operational Support Load of full demonstrator < 2 FTEs • More than 45000 CPU*days used
For more info • GGF • http://www.ggf.org/ • Globus • http://www.globus.org/ • Grid2003 • http://www.ivdgl.org/grid2003/ • Monitoring • http://grid.uchicago.edu/metrics/
Questions and discussion John Hicks Indiana University jhicks@iu.edu