Virtual Organization Approach for Running HEP Applications in Grid Environment

Institute of Computer Science AGH-UST Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał1, Łukasz Dutka1, Renata Słota2, Krzysztof Korcyl3, Maciej Janusz2, Jacek Kitowski1,2 1 ACC CYFRONET AGH, Cracow2 Institute of Computer Science AGH-UST, Cracow 3 Institute of Nuclear Physics PAN, Cracow

Outline • Introduction • Application • Requirements • Architecture • Virtual Organization • Requirements • Certification • Monitoring • Dynamic Processing Tasks Pool • Summary/Conclusions

Introduction • Part of int.eu.grid project • State of Art. – current interactive applications in GRID • Crossgrid • Reality Grid • GridPP • Why grid? • Gain more resources • Use resources for different purposes when ATLAS is off • Why VO? • Provides stable, but dynamic environment for an application • HEP application is to process all events without any loss of data

Copenhagen Edmonton Kraków Manchester Remote Processing Farms PF PF PF PF SFI SFI SFI Packet Switched WAN: GEANT lightpath Back End Network PF SFOs Local Event Processing Farms mass storage Dispatcher PF CERN CERN Computing Center HEP Application in Brief

Sensors Lvl1 filters 2.5 s 120 GB/s Buffers ~ 10 ms Lvl2 filters ~4 GB/s LVL3 Event Filter Event Filter N/work SFI ~ sec EFP EFP Event Filter Processors EFN EFP EFP SFO ~ 300 MB/s HEP Application in Brief • Three filtering levels • Hardware level (lvl 1) • Small local farm (lvl 2) • Complex events filtering (lvl 3) • SFI - SubFarm Interface • SFO - SubFarm Output • PF – Processing Farm

HEP Requirements • Real-time application • High throughput (estimated) • 3500 event per second • 1.5MB per event • Average 1 second to compute one event on typical processor • Infrastructure monitoring (for load balancing) • Efficient way to distribute events to worker nodes • Grid job submission mechanism is not sufficient • Simple job submission takes minutes • Have seconds... • Failure recovery • Malfunctions of single nodes are acceptable, but have to be detected • Application monitoring • Infrastructure monitoring (for availability checks)

HEP application integration with GRID • Job submission for each event is too slow • Job submission for bunch of event is still too slow • We need interactive communication • Pilot job idea • One job to allocate a node and start PT (Processing Task) • Dedicated queue in LRMS for HEP pilot jobs (HEP VO) • One PT processes many events • Direct communication between PT and ATLAS experiment • Faster than job submission • ATLAS experiment provides event (1.5MB/event) • PT responds with events analysis results (1Kb/event) • Asynchronous communication with events buffering • Limited lifetime of PT to allow dynamic resource allocation • Lifetime set by queue or PT configuration

GRID UI WMS Dispatcher EFD Infrastructure monitoring Buffer EFD Buffer HEP VO proxyPT HEP VO Database CE SFI SFI SFI Events Application Monitoring PT PT PT PT PT PT Local PT Farm CERN PT PT PT PT CE PT PT PT PT PT PT PT PT Remote PTs WNs WNs Proposed HEP Architecture

Components • EFD (Event Filter Dataflow) • Takes event from SFI and place them in local buffer • Events are distributed to PT (local or remote) • Depending on PT's answer event is stored or flushed • Processing Task (PT) • Runs on worker nodes (WNs) • Process events and answers with short analysis data • ProxyPT • Interface to remote PT • Dispatcher • Coordinates task distribution from EFD to PTs • Infrastructure Monitoring • Network load/status, WNs status • Application Monitoring • PT application state

HEP Virtual Organization • Purpose • Provides runtime environment for HEP application • Fulfills application’s requirements • Realizes site certification process • Architecture • High level (static) - sites • Certification • Agreements • Configuration guidelines • Functional tests • Low level (dynamic) - resources • Runtime environment • Dynamic resource allocation • Monitoring and failover • Load balancing

Site certification - Requirements • Long-term ability to provide services and resources • Legal issues/agreements • LRMS configuration • Dedicated high priority (but short) queues on computing elements for jobs from HEP VO • Ability to safely communicate between site's WNs and CERN HEP nodes: • Opened specified port on CERN side • Opened specified port on site side • Trusted proxy to setup two way communication • Channel encryption

HEP VO site operation process • Certification phase • Long term tests for reliability performance and updates (application, databases) • Sites tested using artificial/calibration data • Communication between site's WNs and CERN HEP nodes • Operation phase with runtime environment monitoring • Operates on production data • Checks during PTs startup • Proper environment, up-to-date application, databases, etc. • Infrastructure and application monitoring • Dynamic resource allocation • Excluding nodes/sites which are frequently unavailable • Temporary excluded sites/nodes can not process real data, but they can still receive test jobs

High level VO Site certification Operation Low level VO HEP Virtual Organization Certification Agreements Guidelines Management issues Communication Site availability statistics Functional tests Dynamic resource allocation Monitoring

Monitoring for HEP VO • Takes advantage of monitoring • Monitoring using external tools • Application Monitoring (with tool like J-OCM): • deployed on every worker node running HEP PT • provides information about current execution status • monitors computation time • JIMS for Infrastructure Monitoring • availability of worker node • load of worker node • free memory • network throughput between CERN and remote computing farm • Failover

Dynamic resource allocation • Dynamic Processing Tasks pool • Malfunctioning PT excluded from runtime environment (low level VO) • PT lifetime limited by queue length (walltime) • each ‘normal’ job has it’s own lifetime specified before execution • ‘interactive’ type of job • pool has to be refreshed periodically • Fair sharing of resources

Summary, conclusions • High/Low-level VO • Site certification and software validation for HEP application • HEP oriented site functionality tests • On-line validation of site configuration • Statistical analysis of HEP processing • Dynamic Processing Task Pool

Virtual Organization Approach for Running HEP Applications in Grid Environment

Virtual Organization Approach for Running HEP Applications in Grid Environment

Presentation Transcript

Grid Canada Testbed using HEP applications

Component-based Grid Environment for Programming Scientific Applications

WP8 HEP Applications

HEP Applications status in EELA

GridX1: A Canadian Computational Grid for HEP Applications

Interactive European Grid Environment for HEP Application with Real Time Requirements

( HEP ) GRID Activities in Hungary

NMI Testbed GRID Utility for Virtual Organization

Grid based Flood Prediction Virtual Organization

HEP GRID computing in Poland

The EELA Grid Infrastructure and HEP Applications in Latin America

HEP Data Grid in Japan

Tree Based Approach for Secure Group Communication in Grid Environment

Development of Grid Environment for Interactive Applications

Databases and applications in a distributed GRID environment

A Web/Grid Services Approach for a Virtual Research Environment Implementation

WP8 HEP Applications

HEP grid computing in Portugal

Distributed Cloud Environment for PL-Grid Applications

Geant4 in further HEP applications