300 likes | 314 Views
CHEP04 Track4: Distributed Computing Services. Summary of the parallel session “Distributed Computing Services”. Massimo Lamanna / CERN, October 1 st 2004. Parallel sessions. Monday 12 contributions Main focus: data management Wednesday 8 contributions Main focus: middleware
E N D
CHEP04Track4: Distributed Computing Services Summary of the parallel session “Distributed Computing Services” Massimo Lamanna / CERN, October 1st 2004
Parallel sessions • Monday • 12 contributions • Main focus: data management • Wednesday • 8 contributions • Main focus: middleware • Wednesday “Special Security Session” • 10 contributions • Running in parallel with the “middleware” track • Summary transparencies from Andrew McNab • Thursday • 12 contributions • Main focus: monitor and workload
Monday • [142] Don Quijote - Data Management for the ATLAS Automatic Production System by Mr. BRANCO, Miguel • [190] Managed Data Storage and Data Access Services for Data Grids by ERNST, Michael • [204] FroNtier: High Performance Database Access Using Standard Web Components in a Scalable Multi-tier Architecture by PATERNO, Marc • [218] On Distributed Database Deployment for the LHC Experiments by DUELLMANN, Dirk • [253] Experiences with Data Indexing services supported by the NorduGrid middleware by SMIRNOVA, Oxana • [278] The Evolution of Data Management in LCG-2 by Mr. BAUD, Jean-Philippe • [328] The Next Generation Root File Server by Mr. HANUSHEVSKY, Andrew • [334] Production mode Data-Replication framework in STAR using the HRM Grid by Dr. HJORT, Eric • [345] Storage Resource Managers at Brookhaven by RIND, Ofer • [392] File-Metadata Management System for the LHCb Experiment by Mr. CIOFFI, Carmine • [414] Data Management in EGEE by NIENARTOWICZ, Krzysztof • [460] SAMGrid Integration of SRMs by Dr. KENNEDY, Robert
Wednesday • [383] Experience with POOL from the first three Data Challenges using the LCG by GIRONE, Maria • [247] Middleware for the next generation Grid infrastructure by LAURE, Erwin • [184] The Clarens Grid-enabled Web Services Framework: Services and Implementation by STEENBERG, Conrad • [305] First Experiences with the EGEE Middleware by KOBLITZ, Birger • [430] Global Distributed Parallel Analysis using PROOF and AliEn by RADEMAKERS, Fons • [162] Software agents in data and workflow management by BARRASS, T A • [500] Housing Metadata for the Common Physicist Using a Relational Database ST. DENIS, Richard • [196] Lattice QCD Data and Metadata Archives at Fermilab and the International Lattice Data Grid by NEILSEN, Eric • [536] Huge Memory systems for data-intensive science by MOUNT, Richard
Wednesday (Special Security Session) • [224] Evaluation of Grid Security Solutions using Common Criteria by NAQVI, SYED • [463] Mis-use Cases for the Grid by SKOW, Dane • [164] Using Nagios for intrusion detection by CARDENAS MONTES, Miguel • [189] Secure Grid Data Management Technologies in ATLAS by BRANCO, Miguel • [249] The GridSite authorization systemby MCNAB, Andrew • [439] Building Global HEP Systems on Kerberos by CRAWFORD, Matt • [104] Authentication/Security services in the ROOT framework by GANIS, Gerardo • [122] A Scalable Grid User Management System for Large Virtual Organizations by CARCASSI, Gabriele • [191] Virtual Organization Membership Service eXtension (VOX) by FISK, Ian • [194] G-PBox: a policy framework for Grid environments by RUBINI, Gianluca
Thursday • [69] Resource Predictors in HEP Applications by HUTH, John • [318] The STAR Unifid Meta-Scheduler project, a front end around evolving technologies for user analysis and data production. by LAURET, Jerome • [321] SPHINX: A Scheduling Middleware for Data Intensive Applications on a Grid by CAVANAUGH, Richard • [417] Information and Monitoring Services within a Grid Environment by WILSON, Antony • [420] Pratical approaches to Grid workload and resource management in the EGEE projectby SGARAVATTO, Massimo • [490] Grid2003 Monitoring, Metrics, and Grid Cataloging System by MAMBELLI, Marco; KIM, Bockjoo Kim • [89] MonALISA: An Agent Based, Dynamic Service System to Monitor, Control and Optimize Grid based Applications. by LEGRAND, Iosif • [274] Design and Implementation of a Notification Model for Grid Monitoring Events by DE BORTOLI, Natascia • [338] BaBar Book Keeping project - a distributed meta-data catalog of the BaBar event store. by SMITH, Douglas • [388] A Lightweight Monitoring and Accounting System for LHCb DC04 Production by SANCHEZ GARCIA, Manuel • [393] Development and use of MonALISA high level monitoring services for Meta-Schedulersby EFSTATHIADIS, Efstratios • [377] DIRAC - The Distributed MC Production and Analysis for LHCb by TSAREGORODTSEV, Andrei
Structure of the talk • Security • “Data Management” • “Middleware” • “Monitor and Workload” • Conclusions and outlook • I would like to thank the track co-coordinator, Ruth Pordes and all the sessions chairs (Conrad Steenberg, Robert Kennedy, Andrew McNab, Oxana Smironova) • Disclaimer: it was not really useful to list *all* talks. This summary reflects my personal view (and biases) trying to extract key points from all the material shown and discussed in the “Distributed Computing Services” parallel session
Security: Themes • Pre-Grid services like ssh on Grid machines are already under attack! • People are developing tools to look for attacks. • Grids still need to interface to security used by pre-Grid systems like Kerberos, AFS and WWW • We are developing tools to manage 1000s of users in big experiments. • Application-level software developers are starting to interface to Grid security systems. Summary from Andrew McNab
Security: Technologies • Monitoring of local security detection software (Tripwire) etc to Nagios was presented. • VOMS, VOMRS/VOX, GUMS all discussed for distributing authorization information about users • GridSite provides Grid extensions to Apache. • Kerberos sites still need to be supported/included. • Implications of a Web Services future is in everyone's mind... Summary from Andrew McNab
Security: For non-Security people! • Developers: • Most attacks possible because of poor software quality (buffer overflows etc) • Some evidence that stolen Grid credentials have been tried out also: they will go after middleware bugs next • Site administrators: • Local exploits are now really important, not just network exploits (Grids have 1000s of “local” users.) • You will need monitoring to differentiate between “Grid worms” and “Grid jobs” (they look the same!) Summary from Andrew McNab
Data management: Themes • At least three threads: • New tools/services becoming reality • Approaching maturity • Experiments confronted with the existing (and evolving) data management layer • Comparison with the talks presented at CHEP03 very instructive: It looks a lot has been achieved in this field in the last 1 and ½ year!
Data Management: New tool/services becoming reality • Impressive demonstration of the maturity level reached by POOL together with the 3 LHC experiments • 400+ TB, same order of previous exercises using Objectivity/DB • Key ingredients: experience and experiments requirements and pressure • Interplay of data base technology and native grid service for data distribution and replication • FronTier (FNAL, running experiments • Decouple development and user data access • Scalable • Many commodity tools and techniques (SQUID) • Simple to deploy • LCG 3D (CERN, LHC experiments) • Sustainable infrastructure • SAM, BaBar DM • Experience with running experiments • gLite Data management • New technology and experience Convergence foreseen and envisageble
xrootd • Rich but efficient server protocol • Combines file serving with P2P elements • Allows client hints for improved performance • Pre-read, prepare, client access & processing hints, • Multiplexed request stream • Multiple parallel requests allowed per client • An extensible base architecture • Heavily multi-threaded • Clients are dedicated threads whenever possible • Extensive use of OS I/O features • Async I/O, device polling, etc. • Load adaptive reconfiguration. • Key element in the proposal for a Huge-Memory Systems for Data-Intensive Science (R. Mount)
Data management: Approaching maturity… • SRM implementations • Not trivial • But being demonstrated • Great news!
Experiments confronted with the existing (and evolving) data management layer • Cfr. most of the plenary talks, e.g. A. Boehnlein, P. Elmer, D. Stickland, N. Katayama, I. Bird,… • Track 4 talks: • Grids not Grid • Heterogeneity of the grid resources (cfr. ATLAS/Don Quijote) • Independent evolution and experience (Nordugrid) • Production mode (Experiments’ data challenges) • Evolution of LCG2 Data Management
Experiments confronted with the existing (and evolving) data management layer
Middleware: Themes • New generation of middleware becoming available • Some commonality on technology • Service Oriented Architecture; Web Services • gLite (EGEE project) • Web services • GAE • Interactivity as a goal as opposed to “production” mode • RPC-based web service framework (Clarence) • Emphasis on discovery services and high level services (orchestration) • Compatibility with gLite to be explored • DIRAC • XML-RPC: no need for WSDL… • Instant messages protocol inter service/agents communication • Connection based; outbound connectivity only • Interact with other exp. specific services (cfr. File-Metadata Management System For The LHCb Experiment“) • Agent-based system • DIRAC, GAE, PheDEX • (First) feedback coming (developments in the experiments, ARDA)
Middleware • gLite middleware • Lightweight (existing) services • Easily and quickly deployable • Use existing services where possible asbasis for re-engineering • Interoperability • Allow for multiple implementations • Perf/Scale. & Resilience/Fault Tolerance • Large-scale deployment and continuous usage • Portable • Being built on Scientific Linux and Windows • Co-existence with deployed infrastructure • Reduce requirements on participating sites • Flexible service deployment • Multiple services running on the same physical machine (if possible) • Co-existence with LCG-2 and OSG (US) are essential for the EGEE Grid service • Service oriented approach • Follow WSRF standardization • No mature WSRF implementations exist to-date so start with plain WS • WSRF compliance is not an immediate goal, but we follow the WSRF evolution • WS-I compliance is important
Middleware: themes • Dynamics of the evolution of the middleware very complex • Experience injected in the projects • Previous/other projects • Experiment contributions • Essential inputs: cfr: CMS TMDB-Phedex presentation • Close feedback loop • ARDA in the case of gLite • Users/DataChallenges • Large(r) user community being exposed
PROOF • Interactive analysis + parallelism (ROOT) • PROOF on the GRID (2003: demo with Alien, gLite end 2004) • PROOF Analysis interface (Portal)
Monitoring systems • Many different monitoring systems used (Ganglia, MDS, GridICE, Monalisa, R-GMA, LHCb DIRAC system, …) • In different combinations on different systems (LCG-2, Grid 2003 gridcat, BNL SUMS, etc…) • Positive point: hybrid systems are possible! • Essential to have “global” views (planning, scheduling,…) • Different systems are capable to coexist (Grid3 uses 3 of them) • Monalisa very widely used • Used in a very large and diversified set of systems (computing fabric, network performance tests, application like VRVS, resource brokering STAR SUMS, security,…). 160+ sites. • Situation is getting clearer at the system level. Less clear (at least to me) for the application monitoring.
Workload management systems • BNL STAR SUMS systems • Emphasis on stability • Running experiment! • Lot of users! • A front end to local and distributed RMS acting like a client to multiple, heterogeneous RMS • A flexible opened architecture, object oriented framework in which with plug-and-play features • A good environment for further developing • Standards (such as High level JDL) • Scalability of other components (ML work, immediate use) • Used in STAR for real Physics (usage and publication list) • Usedfor Distributed / Grid Simulation job submission • Used successfully by other experiments
Workload Management System • EGEE gLite WMS is being released • Evolution of the EDG WMS • Provides both “push” and “pull” modes
Optimisation and accounting • Similar concepts at work in different activites • “Phenomenological” estimates based on few parameters (J. Huth et al.) • Parametrize the application required time as: T = g (a + b * n_events). g contains the CPU power, the compilation flags (optimised/debug), linear with event size. • Warning: in a multiVO multi-user environment the situation could be much more complicated • BNL STAR SUMS: minimize the (estimated) transit time • Observe the TT and act consequently (Uses Monalisa) • Up to now OK only for system out of the saturation zone • Sphinx project (GAE) • EGEE gLite • Inside the WMs • It looks we are approaching the phase where “Grid Accounting” will really distinguish from “Grid Monitoring” and static resource allocation • EGEE gLite (WMS talk) • Relatively easier for a single VO system • LHCb DIRAC accounting system (still a reporting system coupled to the monitor system)
Sphinx Measurements
Pull approach: DIRAC workload management • Realizes PULL scheduling paradigm • Agents are requesting jobs whenever the corresponding resource is free • Using Condor ClassAd and Matchmaker for finding jobs suitable to the resource profile • Agents are steering job execution on site • Jobs are reporting their state and environment to central Job Monitoring service • Averaged 420ms match time over 60,000 jobs • Queued jobs grouped by categories • Matches performed by category • Typically 1,000 to 20,000 jobs queued
Metadata • Considerable efforts going on in the community • The SAM-Grid Team and the Metadata Working Group, EGEE gLite, LCG ARDA • Material from running experiments (notably CDF and BaBar), HEPCAL, LHC experiment … • Cfr. also: “Lattice QCD Data and Metadata Archives at Fermilab and the International Lattice Data Grid” Neilsen/Simone • On the border between “generic” middleware and “experiment-specific” software • Probably there is the need of a generic layer • It will emerge distilling out the experience on the “experiment-specific” side and technology considerations
Metadata: Babar system • Mirroring system in place (heterogeneous technologies in use: mySQL and ORACLE) • Publish/synch system developed in house • Distribution down to users’ laptops
Conclusions and outlook • Experience is still (and will be) more effective that pure technology • Look at the running experiments! • See the powerful boost from the large data challenges! • SRM is the candidate to be a first high-level middleware service • Good news! • Why only SRM? • What about, for example, other data management tools? • Metadata catalogues? • … • The fast evolution show the vitality and enthusiasm of the HEP community • How can we use it to progress even faster? • What should we do to converge on other high-level services? • gLite is a unique opportunity: we should not miss it • Grid as a monoculture is not realistic • A recipe? Some ingredients, at least… • The Physics and Physicists! • Analysis is still somewhat missing. More and broader experience needed • Diverse contributions and technology choices but convergence is possible!