1 / 30

CHEP04 Track4: Distributed Computing Services

CHEP04 Track4: Distributed Computing Services. Summary of the parallel session “Distributed Computing Services”. Massimo Lamanna / CERN, October 1 st 2004. Parallel sessions. Monday 12 contributions Main focus: data management Wednesday 8 contributions Main focus: middleware

agnesbrown
Download Presentation

CHEP04 Track4: Distributed Computing Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CHEP04Track4: Distributed Computing Services Summary of the parallel session “Distributed Computing Services” Massimo Lamanna / CERN, October 1st 2004

  2. Parallel sessions • Monday • 12 contributions • Main focus: data management • Wednesday • 8 contributions • Main focus: middleware • Wednesday “Special Security Session” • 10 contributions • Running in parallel with the “middleware” track • Summary transparencies from Andrew McNab • Thursday • 12 contributions • Main focus: monitor and workload

  3. Monday • [142] Don Quijote - Data Management for the ATLAS Automatic Production System by Mr. BRANCO, Miguel • [190] Managed Data Storage and Data Access Services for Data Grids by ERNST, Michael • [204] FroNtier: High Performance Database Access Using Standard Web Components in a Scalable Multi-tier Architecture by PATERNO, Marc • [218] On Distributed Database Deployment for the LHC Experiments by DUELLMANN, Dirk • [253] Experiences with Data Indexing services supported by the NorduGrid middleware by SMIRNOVA, Oxana • [278] The Evolution of Data Management in LCG-2 by Mr. BAUD, Jean-Philippe • [328] The Next Generation Root File Server by Mr. HANUSHEVSKY, Andrew • [334] Production mode Data-Replication framework in STAR using the HRM Grid by Dr. HJORT, Eric • [345] Storage Resource Managers at Brookhaven by RIND, Ofer • [392] File-Metadata Management System for the LHCb Experiment by Mr. CIOFFI, Carmine • [414] Data Management in EGEE by NIENARTOWICZ, Krzysztof • [460] SAMGrid Integration of SRMs by Dr. KENNEDY, Robert

  4. Wednesday • [383] Experience with POOL from the first three Data Challenges using the LCG by GIRONE, Maria • [247] Middleware for the next generation Grid infrastructure by LAURE, Erwin • [184] The Clarens Grid-enabled Web Services Framework: Services and Implementation by STEENBERG, Conrad • [305] First Experiences with the EGEE Middleware by KOBLITZ, Birger • [430] Global Distributed Parallel Analysis using PROOF and AliEn by RADEMAKERS, Fons • [162] Software agents in data and workflow management by BARRASS, T A • [500] Housing Metadata for the Common Physicist Using a Relational Database ST. DENIS, Richard • [196] Lattice QCD Data and Metadata Archives at Fermilab and the International Lattice Data Grid by NEILSEN, Eric • [536] Huge Memory systems for data-intensive science by MOUNT, Richard

  5. Wednesday (Special Security Session) • [224] Evaluation of Grid Security Solutions using Common Criteria by NAQVI, SYED • [463] Mis-use Cases for the Grid by SKOW, Dane • [164] Using Nagios for intrusion detection by CARDENAS MONTES, Miguel • [189] Secure Grid Data Management Technologies in ATLAS by BRANCO, Miguel • [249] The GridSite authorization systemby MCNAB, Andrew • [439] Building Global HEP Systems on Kerberos by CRAWFORD, Matt • [104] Authentication/Security services in the ROOT framework by GANIS, Gerardo • [122] A Scalable Grid User Management System for Large Virtual Organizations by CARCASSI, Gabriele • [191] Virtual Organization Membership Service eXtension (VOX) by FISK, Ian • [194] G-PBox: a policy framework for Grid environments by RUBINI, Gianluca

  6. Thursday • [69] Resource Predictors in HEP Applications by HUTH, John • [318] The STAR Unifid Meta-Scheduler project, a front end around evolving technologies for user analysis and data production. by LAURET, Jerome • [321] SPHINX: A Scheduling Middleware for Data Intensive Applications on a Grid by CAVANAUGH, Richard • [417] Information and Monitoring Services within a Grid Environment by WILSON, Antony • [420] Pratical approaches to Grid workload and resource management in the EGEE projectby SGARAVATTO, Massimo • [490] Grid2003 Monitoring, Metrics, and Grid Cataloging System by MAMBELLI, Marco; KIM, Bockjoo Kim • [89] MonALISA: An Agent Based, Dynamic Service System to Monitor, Control and Optimize Grid based Applications. by LEGRAND, Iosif • [274] Design and Implementation of a Notification Model for Grid Monitoring Events by DE BORTOLI, Natascia • [338] BaBar Book Keeping project - a distributed meta-data catalog of the BaBar event store. by SMITH, Douglas • [388] A Lightweight Monitoring and Accounting System for LHCb DC04 Production by SANCHEZ GARCIA, Manuel • [393] Development and use of MonALISA high level monitoring services for Meta-Schedulersby EFSTATHIADIS, Efstratios • [377] DIRAC - The Distributed MC Production and Analysis for LHCb by TSAREGORODTSEV, Andrei

  7. Structure of the talk • Security • “Data Management” • “Middleware” • “Monitor and Workload” • Conclusions and outlook • I would like to thank the track co-coordinator, Ruth Pordes and all the sessions chairs (Conrad Steenberg, Robert Kennedy, Andrew McNab, Oxana Smironova) • Disclaimer: it was not really useful to list *all* talks. This summary reflects my personal view (and biases) trying to extract key points from all the material shown and discussed in the “Distributed Computing Services” parallel session

  8. Security: Themes • Pre-Grid services like ssh on Grid machines are already under attack! • People are developing tools to look for attacks. • Grids still need to interface to security used by pre-Grid systems like Kerberos, AFS and WWW • We are developing tools to manage 1000s of users in big experiments. • Application-level software developers are starting to interface to Grid security systems. Summary from Andrew McNab

  9. Security: Technologies • Monitoring of local security detection software (Tripwire) etc to Nagios was presented. • VOMS, VOMRS/VOX, GUMS all discussed for distributing authorization information about users • GridSite provides Grid extensions to Apache. • Kerberos sites still need to be supported/included. • Implications of a Web Services future is in everyone's mind... Summary from Andrew McNab

  10. Security: For non-Security people! • Developers: • Most attacks possible because of poor software quality (buffer overflows etc) • Some evidence that stolen Grid credentials have been tried out also: they will go after middleware bugs next • Site administrators: • Local exploits are now really important, not just network exploits (Grids have 1000s of “local” users.) • You will need monitoring to differentiate between “Grid worms” and “Grid jobs” (they look the same!) Summary from Andrew McNab

  11. Data management: Themes • At least three threads: • New tools/services becoming reality • Approaching maturity • Experiments confronted with the existing (and evolving) data management layer • Comparison with the talks presented at CHEP03 very instructive: It looks a lot has been achieved in this field in the last 1 and ½ year!

  12. Data Management: New tool/services becoming reality • Impressive demonstration of the maturity level reached by POOL together with the 3 LHC experiments • 400+ TB, same order of previous exercises using Objectivity/DB • Key ingredients: experience and experiments requirements and pressure • Interplay of data base technology and native grid service for data distribution and replication • FronTier (FNAL, running experiments • Decouple development and user data access • Scalable • Many commodity tools and techniques (SQUID) • Simple to deploy • LCG 3D (CERN, LHC experiments) • Sustainable infrastructure • SAM, BaBar DM • Experience with running experiments • gLite Data management • New technology and experience Convergence foreseen and envisageble

  13. xrootd • Rich but efficient server protocol • Combines file serving with P2P elements • Allows client hints for improved performance • Pre-read, prepare, client access & processing hints, • Multiplexed request stream • Multiple parallel requests allowed per client • An extensible base architecture • Heavily multi-threaded • Clients are dedicated threads whenever possible • Extensive use of OS I/O features • Async I/O, device polling, etc. • Load adaptive reconfiguration. • Key element in the proposal for a Huge-Memory Systems for Data-Intensive Science (R. Mount)

  14. Data management: Approaching maturity… • SRM implementations • Not trivial • But being demonstrated • Great news!

  15. Experiments confronted with the existing (and evolving) data management layer • Cfr. most of the plenary talks, e.g. A. Boehnlein, P. Elmer, D. Stickland, N. Katayama, I. Bird,… • Track 4 talks: • Grids not Grid • Heterogeneity of the grid resources (cfr. ATLAS/Don Quijote) • Independent evolution and experience (Nordugrid) • Production mode (Experiments’ data challenges) • Evolution of LCG2 Data Management

  16. Experiments confronted with the existing (and evolving) data management layer

  17. Middleware: Themes • New generation of middleware becoming available • Some commonality on technology • Service Oriented Architecture; Web Services • gLite (EGEE project) • Web services • GAE • Interactivity as a goal as opposed to “production” mode • RPC-based web service framework (Clarence) • Emphasis on discovery services and high level services (orchestration) • Compatibility with gLite to be explored • DIRAC • XML-RPC: no need for WSDL… • Instant messages protocol inter service/agents communication • Connection based; outbound connectivity only • Interact with other exp. specific services (cfr. File-Metadata Management System For The LHCb Experiment“) • Agent-based system • DIRAC, GAE, PheDEX • (First) feedback coming (developments in the experiments, ARDA)

  18. Middleware • gLite middleware • Lightweight (existing) services • Easily and quickly deployable • Use existing services where possible asbasis for re-engineering • Interoperability • Allow for multiple implementations • Perf/Scale. & Resilience/Fault Tolerance • Large-scale deployment and continuous usage • Portable • Being built on Scientific Linux and Windows • Co-existence with deployed infrastructure • Reduce requirements on participating sites • Flexible service deployment • Multiple services running on the same physical machine (if possible) • Co-existence with LCG-2 and OSG (US) are essential for the EGEE Grid service • Service oriented approach • Follow WSRF standardization • No mature WSRF implementations exist to-date so start with plain WS • WSRF compliance is not an immediate goal, but we follow the WSRF evolution • WS-I compliance is important

  19. Middleware

  20. Middleware: themes • Dynamics of the evolution of the middleware very complex • Experience injected in the projects • Previous/other projects • Experiment contributions • Essential inputs: cfr: CMS TMDB-Phedex presentation • Close feedback loop • ARDA in the case of gLite • Users/DataChallenges • Large(r) user community being exposed

  21. PROOF • Interactive analysis + parallelism (ROOT) • PROOF on the GRID (2003: demo with Alien, gLite end 2004) • PROOF Analysis interface (Portal)

  22. Monitoring systems • Many different monitoring systems used (Ganglia, MDS, GridICE, Monalisa, R-GMA, LHCb DIRAC system, …) • In different combinations on different systems (LCG-2, Grid 2003 gridcat, BNL SUMS, etc…) • Positive point: hybrid systems are possible! • Essential to have “global” views (planning, scheduling,…) • Different systems are capable to coexist (Grid3 uses 3 of them) • Monalisa very widely used • Used in a very large and diversified set of systems (computing fabric, network performance tests, application like VRVS, resource brokering STAR SUMS, security,…). 160+ sites. • Situation is getting clearer at the system level. Less clear (at least to me) for the application monitoring.

  23. Workload management systems • BNL STAR SUMS systems • Emphasis on stability • Running experiment! • Lot of users! • A front end to local and distributed RMS acting like a client to multiple, heterogeneous RMS • A flexible opened architecture, object oriented framework in which with plug-and-play features • A good environment for further developing • Standards (such as High level JDL) • Scalability of other components (ML work, immediate use) • Used in STAR for real Physics (usage and publication list) • Usedfor Distributed / Grid Simulation job submission • Used successfully by other experiments

  24. Workload Management System • EGEE gLite WMS is being released • Evolution of the EDG WMS • Provides both “push” and “pull” modes

  25. Optimisation and accounting • Similar concepts at work in different activites • “Phenomenological” estimates based on few parameters (J. Huth et al.) • Parametrize the application required time as: T = g (a + b * n_events). g contains the CPU power, the compilation flags (optimised/debug), linear with event size. • Warning: in a multiVO multi-user environment the situation could be much more complicated • BNL STAR SUMS: minimize the (estimated) transit time • Observe the TT and act consequently (Uses Monalisa) • Up to now OK only for system out of the saturation zone • Sphinx project (GAE) • EGEE gLite • Inside the WMs • It looks we are approaching the phase where “Grid Accounting” will really distinguish from “Grid Monitoring” and static resource allocation • EGEE gLite (WMS talk) • Relatively easier for a single VO system • LHCb DIRAC accounting system (still a reporting system coupled to the monitor system)

  26. Sphinx Measurements

  27. Pull approach: DIRAC workload management • Realizes PULL scheduling paradigm • Agents are requesting jobs whenever the corresponding resource is free • Using Condor ClassAd and Matchmaker for finding jobs suitable to the resource profile • Agents are steering job execution on site • Jobs are reporting their state and environment to central Job Monitoring service • Averaged 420ms match time over 60,000 jobs • Queued jobs grouped by categories • Matches performed by category • Typically 1,000 to 20,000 jobs queued

  28. Metadata • Considerable efforts going on in the community • The SAM-Grid Team and the Metadata Working Group, EGEE gLite, LCG ARDA • Material from running experiments (notably CDF and BaBar), HEPCAL, LHC experiment … • Cfr. also: “Lattice QCD Data and Metadata Archives at Fermilab and the International Lattice Data Grid” Neilsen/Simone • On the border between “generic” middleware and “experiment-specific” software • Probably there is the need of a generic layer • It will emerge distilling out the experience on the “experiment-specific” side and technology considerations

  29. Metadata: Babar system • Mirroring system in place (heterogeneous technologies in use: mySQL and ORACLE) • Publish/synch system developed in house • Distribution down to users’ laptops

  30. Conclusions and outlook • Experience is still (and will be) more effective that pure technology • Look at the running experiments! • See the powerful boost from the large data challenges! • SRM is the candidate to be a first high-level middleware service • Good news! • Why only SRM? • What about, for example, other data management tools? • Metadata catalogues? • … • The fast evolution show the vitality and enthusiasm of the HEP community • How can we use it to progress even faster? • What should we do to converge on other high-level services? • gLite is a unique opportunity: we should not miss it • Grid as a monoculture is not realistic • A recipe? Some ingredients, at least… • The Physics and Physicists! • Analysis is still somewhat missing. More and broader experience needed • Diverse contributions and technology choices but convergence is possible!

More Related