Distributed Metadata with the AMGA Metadata Catalog

Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management

Abstract • Metadata Catalogs on Data Grids – The case for replication • The AMGA Metadata Catalog • Metadata Replication with AMGA • Benchmark Results • Future Work/Open Challenges Workshop on Next-Generation Distributed Data Management - 20 June 2006

Metadata Catalogs • Metadata on the Grid • File Metadata - Describe files with application-specific information • Purpose: file discovery based on their contents • Simplified Database Service–Store generic structured data on the Grid • Not as powerful as a DB, but easier to use and better Grid integration (security, hide DB heterogeneity) • Metadata Services are essential for many Grid applications • Must be accessible Grid-wide But Data Grids can be large… Workshop on Next-Generation Distributed Data Management - 20 June 2006

An Example - The LCG Sites • LCG – LHC Computing Grid • Distribute and process the data generated by the LHC (Large Hadron Collider) at CERN • ~200 sites and ~5.000 users worldwide Taken from: http://goc03.grid-support.ac.uk/googlemaps/lcg.html Workshop on Next-Generation Distributed Data Management - 20 June 2006

Challenges for Catalog Services • Scalability • Hundreds of grid sites • Thousands users • Geographical Distribution • Network latency • Dependability • In a large and heterogeneous system, failures will be common • A centralized system does not meet the requirements • Distribution and replicationrequired Workshop on Next-Generation Distributed Data Management - 20 June 2006

Off-the-shelf DB Replication? • Most DB systems have DB replication mechanisms • Oracle Streams, Slony for PostgreSQL, MySQL replication • Example: 3D Project at CERN (Distributed Deployment of Databases) • Uses Oracle Streams for replication • Being deployed only at a few LCG sites (~10 sites, Tier-0 and Tier-1s) • Requires Oracle ($$$) and expert on-site DBAs ($$$) • Most sites don’t have these resources • Off-the-shelf replication is vendor-specific • But Grids are heterogeneous by nature • Sites have different DB systems available Only partial solution to the problem of metadata replication Workshop on Next-Generation Distributed Data Management - 20 June 2006

Replication in the Catalog • Alternative we are exploring: Replication in the Metadata Catalog • Advantages • Database independent • Metadata-aware replication • More efficient – replicate Metadata commands • Better functionality – Partial replication, federation • Ease of deployment and administration • Built-in into the Metadata Catalog • No need for dedicated DB admin • The AMGA Metadata Catalogue is the basis for our work on replication Workshop on Next-Generation Distributed Data Management - 20 June 2006

The AMGA Metadata Catalog • Metadata Catalog of the gLite Middleware (EGEE) • Several groups of users among the EGEE community: • High Energy Physics • Biomed • Main features • Dynamic schemas • Hierarchical organization • Security: • Authentication: user/pass, X509 Certs, GSI • Authorization: VOMS, ACLs Workshop on Next-Generation Distributed Data Management - 20 June 2006

AMGA Implementation • C++ implementation • Back-ends • Oracle, MySQL, PostgreSQL, SQLite • Front-end - TCP Streaming • Text-based protocol like TELNET, SMTP, POP… • Examples: Adding data Retrieving data addentry /DLAudio/song.mp3 /DLAudio:Author ‘John Smith’ /DLAudio:Album ‘Latest Hits’ selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album ‘like(/DLAudio:FILE, “%.mp3")‘ Workshop on Next-Generation Distributed Data Management - 20 June 2006

Standalone Performance • Single server scales well up to 100 concurrent clients • Could not go past 100. Limited by the database • WAN access one to two orders of magnitude slower than LAN Replication can solve both bottlenecks Workshop on Next-Generation Distributed Data Management - 20 June 2006

Metadata Replication with AMGA Workshop on Next-Generation Distributed Data Management - 20 June 2006

Requirements of EGEE Communities • Motivation: Requirements of EGEE’s user communities. • Mainly HEP and Biomed • High Energy Physics (HEP) • Millions of files, 5.000+ users distributed across 200+ computing centres • Mainly (read-only) file metadata • Main concerns: scalability, performance and fault-tolerance • Biomed • Manage medical images on the Grid • Data produced in a distributed fashion by laboratories and hospitals • Highly sensitive data: patient details • Smaller scale than HEP • Main concern: security Workshop on Next-Generation Distributed Data Management - 20 June 2006

Metadata Replication Some replication models Partial replication Full replication Federation Proxy Workshop on Next-Generation Distributed Data Management - 20 June 2006

Architecture • Main design decisions • Asynchronous replication – for tolerating with high latencies and fault-tolerance • Partial replication – Replicate only what is interesting for the remote users • Master-slave – Writes only allowed on the master • But mastership is granted to metadata collections, not to nodes Workshop on Next-Generation Distributed Data Management - 20 June 2006

Status • Initial implementation completed • Available functionality: • Full and partial replication • Chained replication (master → slave1 → slave2) • Federation - basic support • Data is always copied to slave • Cross DB replication: PostgreSQL → MySQL tested • Other combinations should work (give or take some debugging) • Available as part of AMGA Workshop on Next-Generation Distributed Data Management - 20 June 2006

Benchmark Results Workshop on Next-Generation Distributed Data Management - 20 June 2006

Benchmark Study • Investigate the following: • Overhead of replication and scalability of the master • Behaviour of the system under faults Workshop on Next-Generation Distributed Data Management - 20 June 2006

Scalability • Setup • Insertion rate at master: 90 entries/s. • Total: 10,000 entries • 0 slaves - saving replication updates, but not shipping (slaves disconnected) • Small increase in CPU usage as number of slaves increases • 10 slaves, 20% increase from standalone operation • Number of update logs sent scales almost linearly Workshop on Next-Generation Distributed Data Management - 20 June 2006

Fault Tolerance • Next test illustrates fault tolerance mechanisms • Slave fails • Master keeps the updates for the slave • Replication log grows • Slave reconnects • Master sends pending updates • Eventually system recovers to a steady state with the slave up-to-date • Test conditions: • Insertion rate at master: 50 entries/s • Total: 20.000 entries • Two slaves, both start connected • Slave1 disconnects temporarily Setup: Workshop on Next-Generation Distributed Data Management - 20 June 2006

Fault Tolerance and Recovery • While slave1 is disconnected, the replication log grows in size • Limited in size. Slave unsubscribed if it does not reconnect in time. • After slave reconnection, system recovers in around 60 seconds. Workshop on Next-Generation Distributed Data Management - 20 June 2006

Future Work/Open Challenges Workshop on Next-Generation Distributed Data Management - 20 June 2006

Scalability • Support hundreds of replicas • HEP use case. Extreme case: one replica catalog per site • Challenges • Scalability • Fault-tolerance – tolerate failures of slaves and of master • Current method of shipping updates (direct streaming) might not scale • Chained replication (divide and conquer) • Already possible with AMGA, performance needs to be studied • Group communication Workshop on Next-Generation Distributed Data Management - 20 June 2006

Federation • Federation of independent catalogs • Biomed use case • Challenges • Provide a consistent view over the federated catalogs • Shared namespace • Security - Trust management, access control and user management • Ideas Workshop on Next-Generation Distributed Data Management - 20 June 2006

Conclusion • Replication of Metadata Catalogues necessary for Data Grids • We are exploring replication at the Catalogue using AMGA • Initial implementation completed • First results are promising • Currently working on improving scalability and on federation • More information about our current work at: http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/ Workshop on Next-Generation Distributed Data Management - 20 June 2006

Distributed Metadata with the AMGA Metadata Catalog

Distributed Metadata with the AMGA Metadata Catalog

Presentation Transcript

Encounters with metadata

Hands on Session: AMGA Metadata Catalogue

Understand the Metadata… Be the Metadata…

Grappling with Metadata

The Economy of Distributed Metadata Authoring

Sample catalog metadata

The AMGA metadata catalog with use cases

AMGA Metadata Server

The AMGA metadata catalog

The AMGA metadata catalog

Metadata

AMGA Metadata Services: examples and usage scenarios

Distributed Databases and metadata

Metadata Flow – AQ Community Catalog

ADN Metadata and the DLESE Catalog System

AMGA metadata catalog with use cases

The AMGA metadata catalog

FIS Metadata Catalog

Hands on session: the AMGA Metadata Catalogue

Hands on session: the AMGA Metadata Catalogue

Hands on session: the AMGA Metadata Catalogue

The AMGA metadata catalog with use cases