260 likes | 277 Views
Data Centric Issues. Particle Physics and Grid Data Management Tony Doyle University of Glasgow. Outline: Data to Metadata to Data. Introduction Yesterday “.. all my troubles seemed so far away” (non-Grid) Database Access Data Hierarchy
E N D
Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow
Outline: Data to Metadata to Data • Introduction • Yesterday “.. all my troubles seemed so far away” • (non-Grid) Database Access • Data Hierarchy • Today “.. is the greatest day I’ve ever known” • Grids and Metadata Management • File Replication • Replica Optimisation • Tomorrow “.. never knows” • Event Replication • Query Optimisation
GRID Services: Context Chemistry Cosmology Environment Applications Biology High Energy Physics Data- intensive applications toolkit Remote Visualisation applications toolkit Distributed computing toolkit Problem solving applications toolkit Remote instrumentation applications toolkit Collaborative applications toolkit Application Toolkits : E.g., Resource-independent and application-independent services Grid Services (Middleware) authentication, authorisation, resource location, resource allocation, events, accounting, remote data access, information, policy, fault detection : Resource-specific implementations of basic services Grid Fabric (Resources) E.g., transport protocols, name servers, differentiated services, CPU schedulers, public key infrastructure, site accounting, directory service, OS bypass
Online Data Rate vs Size High Level-1 Trigger(1 MHz) High No. ChannelsHigh Bandwidth(500 Gbit/s) Level 1 Rate (Hz) 106 LHCB ATLAS CMS 105 HERA-B “How can this data reach the end user?” KLOE CDF II 104 High Data Archive(PetaByte) CDF It doesn’t… Factor O(1000) Online data reduction via trigger selection 103 H1ZEUS ALICE NA49 UA1 102 104 106 105 107 LEP Event Size (bytes)
Offline Data Hierarchy “RAW, ESD, AOD, TAG” RAW Recorded by DAQ Triggered events ~1 MB/event Detector digitisation ESD Pseudo-physical information: Clusters, track candidates (electrons, muons), etc. Reconstructed information ~100 kB/event Physical information: Transverse momentum, Association of particles, jets, (best) id of particles, Physical info for relevant “objects” AOD Selected information ~10 kB/event Analysis information TAG Relevant information for fast event selection ~1 kB/event
Analysis Object Data Analysis Object Data Analysis Object Data AOD Physics Analysis ESD: Data or Monte Carlo Event Tags Tier 0,1 Collaboration wide Event Selection Calibration Data Analysis, Skims INCREASINGDATAFLOW Raw Data Tier 2 Analysis Groups Physics Objects Physics Objects Physics Objects Tier 3, 4 Physicists Physics Analysis
Physics Models Monte Carlo Truth Data Detector Simulation MC Raw Data Reconstruction MC Event Summary Data MC Event Tags Data Structure Trigger System Data Acquisition Run Conditions Level 3 trigger Calibration Data Raw Data Trigger Tags Reconstruction Event Summary Data ESD Event Tags REAL and SIMULATED data required. Central and Distributed production.
A running (non-Grid) experiment Three Steps to select an event today • Remote access to O(100) TBytes of ESD data • Via remote access to 100 GBytes of TAG data • Using offline selection e.g. ZeusIO-Variable (Ee>20.0)and(Ntrks>4) • Access to remote store via batch job • 1% database event finding overhead • O(1M) lines of reconstruction code • No middleware • 20k lines of C++ “glue” from Objectivity (TAG) to ADAMO (ESD) database ESD TAG • 100 Million selected events from 5 years’ data • TAG selection via 250 variables/event
A future (Grid) experiment Three steps to (analysis) heaven • 10 (1) PByte of RAW (ESD) data/yr • 1 TByte of TAG data (local access)/yr • Offline selection e.g. ATLASIO-Variable (Mee>100.0)and(Njets>4) • Interactive access to local TAG store • Automated batch jobs to distributed Tier-0, -1, -2 centres • O(1M) lines of reconstruction code • O(1M) lines of middleware… NEW… • O(20k) lines of Java/C++ provide TAG “glue” from TAG to ESD database • All working? • Efficiently? Inter DataBase Solutions Inc. • 1000 Million events from 1 year’s data-taking • TAG selection via 250 variables
GDMP Spitfire Reptor Optor Grid Data Management: Requirements • “Robust” - software development infrastructure • “Secure” – via Grid certificates • “Scalable” – non-centralised • “Efficient” – Optimised replication Examples:
Robust?Development Infrastructure • CVS Repository • management of DataGrid source code • all code available (some mirrored) • Bugzilla • Package Repository • public access to packaged DataGrid code • Development of Management Tools • statistics concerning DataGrid code • auto-building of DataGrid RPMs • publishing of generated API documentation • latest build = Release 1.2 (August 2002) 140506 Lines of Code 10 Languages (Release 1.0)
Robust?Middleware Testbed(s) Validation/ Maintenance =>Testbed(s) EU-wide development
1. Robust?Code Development Issues • Reverse Engineering (C++ code analysis and restructuring; coding standards) => abstraction of existing code to UML architecture diagrams • Language choice (currently 10 used in DataGrid) • Java = C++ - - “features” (global variables, pointer manipulation, goto statements, etc.). • Constraints (performance, libraries, legacy code) • Testing (automation, object oriented testing) • Industrial strength? • OGSA-compliant? • O(20 year) Future proof??
Data Management on the Grid • “Data in particle physics is centred on events stored in a database… Groups of events are collected in (typically GByte) files… In order to utilise additional resources and minimise data analysis time, Grid replication mechanisms are currently being used at the file level.” • Access to a database via Grid certificates (Spitfire/OGSA-DAI) • Replication of files on the Grid (GDMP/Giggle) • Replication and Optimisation Simulation (Reptor/Optor)
HTTP + SSLRequest + client certificate Is certificate signedby a trusted CA? Has certificatebeen revoked? No No Yes Finddefault Role ok? Request a connection ID 2. Spitfire Servlet Container SSLServletSocketFactory RDBMS Trusted CAs TrustManager Revoked Certsrepository “Secure?” At the level required in Particle Physics Security Servlet ConnectionPool Authorization Module Does user specify role? Role repository Translator Servlet Role Connectionmappings Map role to connection id
2. Database client API • A database client API has been defined • Implement as grid service using standard web service technologies • Ongoing development with OGSA-DAI • Talk: • “Project Spitfire - Towards Grid • Web Service Databases”
3. GDMP and the Replica Catalogue CentralisedLDAP based Replica Catalogue TODAY Globus 2.0 Replica Catalogue (LDAP) StorageElement1 StorageElement2 StorageElement3 GDMP 3.0 = File mirroring/replication tool Originally for replicating CMS Objectivity files for High Level Trigger studies. Now used widely in HEP.
3. Giggle: “Hierarchical P2P” RLI Hierarchical indexing. The higher- level RLI contains pointers to lower-level RLIs or LRCs. “Scalable?” Trade-off: Consistency Versus Efficiency RLI RLI RLI = Replica Location Index LRC = Local Replica Catalog LRC LRC LRC LRC LRC Storage Element Storage Element Storage Element Storage Element Storage Element
Tests file replication strategies: e.g. economic model User Interface ReplicaLocation Index Replica Metadata Catalogue ReplicaLocation Index ReplicaLocation Index Resource Broker Site Site Replica Manager LocalReplica Catalogue Replica Manager LocalReplica Catalogue Core API Optimiser Optimiser Optimisation API Pre-/Post- processing Pre-/Post- processing Processing API StorageElement Computing Element StorageElement Computing Element 4. Reptor/Optor: File Replication/ Simulation “Efficient?” Requires simulation Studies… • Reptor: Replica architecture • Optor: Test file replication strategies: economic model • Demo and Poster: • “Studying Dynamic Grid Optimisation • Algorithms for File Replication”
General Observation:Code Development • Reverse Engineering (C++ code analysis and restructuring; coding standards) => abstraction of existing code to UML architecture diagrams • Language choice (currently 10 used in DataGrid) • Java = C++ - - “features” (global variables, pointer manipulation, goto statements, etc.). • Constraints (performance, libraries, legacy code) • Testing (automation, object oriented testing) • Future-proof? OGSA-compliant…
“The current EMBL production database is 150 GB, which takes over four hours to download at full bandwidth capability at the EBI. The EBI's data repositories receive 100,000 to 250,000 hits per day with 20% from UK sites; 563 unique UK domains with 27 sites have more than 50 hits per day.” MyGrid Proposal Suggests: Less emphasis on efficient data access and data hierarchy aspects (application specific). Large gains in biological applications from efficient file replication. Larger gains from application-specific replication? Application Requirements
Events.. to Files.. to Events Event 1 Event 2 Event 3 Data Files Data Files Data Files RAW Tier-0 (International) RAW RAW Data Files RAW Data File ESD Tier-1 (National) Data Files ESD ESD Data Files Data Files ESD Data AOD Tier-2 (Regional) AOD AOD Data Files Data Files Data Files AOD Data TAG Tier-3 (Local) TAG TAG TAG Data Not all pre-filtered events are interesting… Non pre-filtered events may be… File Replication Overhead. “Interesting Events List”
Events.. to EventsEvent Replication and Query Optimisation Event 1 Event 2 Event 3 Distributed (Replicated) Database RAW Tier-0 (International) RAW RAW ESD Tier-1 (National) ESD ESD AOD Tier-2 (Regional) AOD AOD TAG Tier-3 (Local) TAG TAG Knowledge “Stars in Stripes” “Interesting Events List”
Data Grid for the Scientist …In order to get back to the real (or simulated) data. @#%&*! E = mc2 Grid Middleware Incremental Process… Level of the metadata? file?… event?… sub-event?…
Summary • Yesterday’s data access issues are still here • They just got bigger (by a factor 100) • Data Hierarchy is required to access more data more efficiently… insufficient • Today’s Grid tools are developing rapidly • Enable replicated file access across the grid • File replication standard (lfn:\\, pfn:\\) • Emerging standards for Grid Data Access.. • Tomorrow “.. never knows” • Replicated “Events” on the Grid?.. • Distributed databases?.. • or did that diagram look a little too monolithic?