LCG 3D Project Status and Production Plans

LCG 3D Project Status and Production Plans Dirk Duellmann, CERN IT On behalf of the LCG 3D project https://lcg3d.cern.ch CHEP 2006, 15th February, Mumbai

Related Talks • LHCb conditions database framework [168 M. Clemencic] • Database access in Atlas computing model[38 A. Vaniachine] • Software for a variable Atlas detector description [67 V. Tsulaia] • Optimized access to distributed relational database system[331 J. Hrivnac] • COOL Development and Deployment - Status and Plans[337 A.Valassi] • COOL performance and distribution tests [338 A. Valassi poster] • CORAL relational database access software [329 - I. Papadopoulos] • POOL object persistency into relational databases [330 G. Govi poster] Dirk Duellmann

Distributed Deployment of Databases (=3D) • LCG today provides an infrastructure for distributed access to file based data and file replication • Physics applications (and grid services) require similar services for data in relational databases • Physics applications and grid services use RDBMS • LCG sites have already experience in providing RDBMS services • Goals for common project as part of LCG • increase the availability and scalability of LCG and experiment components • allow applications to access data in a consistent, location independent way • allow to connect existing db services via data replication mechanisms • simplify a shared deployment and administration of this infrastructure during 24 x 7 operation • Scope set by LCG PEB • Online - Offline - Tier sites Dirk Duellmann

LCG 3D Service Architecture M Oracle Streams http cache (SQUID) Cross DB copy & MySQL/SQLight Files O S S T0 - autonomous reliable service T1- db back bone - all data replicated - reliable service O O F T2 - local db cache -subset data -only local service • Online DB • autonomous reliable service O M S S R/O Access at Tier 1/2(at least initially) Dirk Duellmann

Building Block for Tier 0/1 - Oracle Database Clusters • Two+ dual-CPU nodes • Shared storage (eg FC SAN) • Scale CPU and I/O ops (independently) • Transparent failover and s/w patches Dirk Duellmann

capture apply apply capture propagation propagation FNAL CERN CERN LCR LCR LCR LCR LCR LCR LCR LCR CNAF Sinica IN2P3 BNL RAL How to keep Databases up-to-date? Asynchronous Replication via Streams insert into emp values ( 03, “Joan”, ….) Slide : Eva Dafonte Perez Dirk Duellmann

CERN RAC FNAL CERN CNAF propagation jobs COPY redo log files Further Decoupling between Databases • Objectives • Remove impact of capture from Tier 0 Database • Isolate Destination sites from each other • pair capture process + queue x each target site • big Streams pool size • redundant events ( x number of queues) capture process capture process capture process SOURCE DATABASE DOWNSTREAM DATABASE DESTINATION SITES Slide : Eva Dafonte Perez Dirk Duellmann

Tier N Squid Squid Squid Tier 1 Squid Squid Squid Squid(s) Tomcat(s) Tier 0 FroNTier Launchpad DB Offline FroNTier Resources/Deployment • Tier-0: 2-3 Redundant FroNTier servers. • Tier-1: 2-3 Redundant Squid servers. • Tier-N: 1-2 Squid Servers. • Typical Squid server requirements: • CPU/MEM/DISK/NIC=1GHz/1 GB/100GB/Gbit • Network: visible to Worker LAN (private network) and WAN (internet) • Firewall: Two Ports open for URI (FroNTier Launchpad) access and SNMP monitoring (typically 8000 and 3401 respectively) • Squid non-requirements • Special hardware (although high-throughput Disk I/O is good) • Cache backup (if disk dies or is corrupted, start from scratch and reload automatically) • Squid is easy to install and requires little on-going administration. http JDBC Slide : Lee Lueking Dirk Duellmann

Test Status : 3D testbed • Replication test progressing well • Offline->T1: • COOL ATLAS : Stefan Stonjek (CERN, RAL, Oxford) • COOL LHCb : Marco Clemencic (CERN, RAL, GridKA?) • FroNtier CMS : Lee Lueking (CERN and several t1/t2 sites) • ARDA AMGA: Birger Koblitz (CERN->CERN) • AMI : Solveig Albrandt (IN2P3->CERN - setting up) • Online->offline: • CMS Conditions : Saima Iqbal (functional testing) • ATLAS : (Gancho Dimitrov) Server setup, pit network • LHCb : planning with LHCb online • Coordination during weekly 3D meetings Dirk Duellmann

LCG Database Deployment Plan • After October ‘05 workshop a database deployment plan has been presented to LCG GDB and MB • http://agenda.cern.ch/fullAgenda.php?ida=a057112 • Two production phases • March - Sept ‘06 : partial production service • Production service (parallel to existing testbed) • H/W requirements defined by experiments/projects • Based on Oracle 10gR2 • Subset of LCG tier 1 sites: ASCC, CERN, BNL, CNAF, GridKA, IN2P3, RAL • Sept ‘06- onwards : full production service • Adjusted h/w requirements (defined at summer ‘06 workshop) • Other tier 1 sites joined in: PIC, NIKHEF, NDG, TRIUMF Dirk Duellmann

Proposed Tier 1 Hardware Setup • Propose to setup for first 6 month • 2/3 dual-cpu database nodes with 2GB or more • Setup as RAC cluster (preferably) per experiment • ATLAS: 3 nodes with 300GB storage (after mirroring) • LHCb: 2 nodes with 100GB storage (after mirroring) • Shared storage (eg FibreChannel) proposed to allow for clustering • 2-3 dual-cpu Squid nodes with 1GB or more • Squid s/w packaged by CMS will be provided by 3D • 100GB storage per node • Need to clarify service responsibility (DB or admin team?) • Target s/w release: Oracle 10gR2 • RedHat Enterprise Server to insure Oracle support Dirk Duellmann

DB Readiness Workshop last week • Readiness of the production services at T0/T1 • status reports from tier 0 and tier 1 sites • technical problems with the proposed setup (RAC clusters)? • Readiness of experiment (and grid) database applications • Application list, code release, data model and deployment schedule • Successful validation at T0 and (if required T1)? • Review site/experiment milestones from the database project plan • (Re-)align with other work plans - eg experiment challenges, SC4 • Detailed presentations of experiments and sites at • http://agenda.cern.ch/fullAgenda.php?ida=a058495 Dirk Duellmann

CERN Hardware evolution for 2006 • Linear ramp-up budgeted for hardware resources in 2006-2008 • Planning next major service extension for Q3 this year Slide : Maria Girone Dirk Duellmann

Frontier Production Configuration at Tier 0 Squid runs in http-accelerator mode (as a reverse proxy server) Slide : Luis Ramos Dirk Duellmann

Tier 1 Progress • Sites largely on schedule for a service start end of March • h/w either installed already (BNL, CNAF, IN2P3) or expect delivery of order shortly (GridKA, RAL) • Some problems with Oracle Clusters technology encountered and solved! • Active participation from sites - DBA community building up • First DBA meeting focusing on RAC installation, setup and monitoring hosted by Rutherford scheduled for second half of March • Need to involve remaining Tier 1 sites now • Establishing contact to PIC, NIKHEF, NSG, TRIUMF to follow workshops, email and meetings Dirk Duellmann

LCG Application s/w Status • Finished major step towards distributed deployment: • added common and configurable handling of server lookup, connection retry, failover and client side monitoring via CORAL • COOL and POOL have released versions based on new CORAL package[talks by I. Papadopoulos and A. Valassi] • FroNTier has been added as plug-in into CORAL • CMS is working on FroNTier caching policy • FroNTier apps need to implement this policy to avoid stale cached data lookups • LCG persistency framework s/w expected to be stable by end of February for distributed deployment as part of SC4 or experiment challenges • Caveat: the experiment conditions data model may stabilize only later -> possible deployment issues Dirk Duellmann

Open Issues • Support for X.509 (proxy) certificates by Oracle? • May need to study possible fallback solutions • Server and support licenses for Tier 1 sites • Instant client distribution within LCG • In discussion with Oracle via commercial contact at CERN Dirk Duellmann

Databases in Middleware & Castor • Took place already for services used in SC3 • Existing setups at the sites • Existing experience with SC workloads -> extrapolate to real production • LFC, FTS - Tier 0 and above • Low volume, but high availability requirements • CERN: Run on 2-node Oracle cluster; outside single box Oracle or MySQL • CASTOR 2 - CERN and some T1 sites • Need to understand scaling up to LHC production rates • Currently not driving the requirements for the database service • Need to consolidate databases configs and procedures • may reduce effort/diversity at CERN and Tier 1 sites Dirk Duellmann

Experiment Applications • Conditions - Driving the database service size at T0 and T1 • EventTAGs (may become significant - need replication tests and concrete experiment deployment models) • Framework integration and DB workload generators exist • successfully tested in various COOL and POOL/FroNTier tests • T0 performance and replication tests (T0->T1) looks ok • Conditions: Online -> Offline replication only starting now • May need additional emphasis for online tests to avoid surprises • CMS and ATLAS are executing online test plans • Progress in defining concrete conditions data models • CMS showed most complete picture (for Magnet Test) • Still quite some uncertainty about volumes, numbers of clients Dirk Duellmann

Summary • Database Deployment Architecture defined • Streams connected Database Clusters for Online, Tier 0 (ATLAS, CMS, LHCb) • Streams connected Database Cluster for Tier 1 (ATLAS, LHCb) • FroNTier/SQUID distribution for Tier 1/Tier 2 (CMS) • File snapshots (SQLight/MySQL) via CORAL/Octopus (ATLAS, CMS) • Database Production Service and Schedule defined • Setup proceeding well at Tier 0 and 1 sites • Start at end of March seems achievable for most sites • Application performance tests progressing • First larger scale conditions replication tests with promising results for streams and frontier technologies • Concrete conditions data models still missing for key detectors Dirk Duellmann

Conclusions • There is little reason to believe that a distributed database service will move into stable production any quicker than any of the other grid services • We should start now to ramp up to larger scale production operation to resolve the unavoidable deployment issues • We need the cooperation of experiments and sites to make sure that concrete requests can be quickly validated against a concrete distributed service Dirk Duellmann

LCG 3D Project Status and Production Plans