250 likes | 361 Views
Data Management. by Cor Cornelisse. Introduction. Data Intensive Applications: Physics (particle accelerators) Simulated Science (super computers) The Large Hadron Collider (LHC) at CERN produces several petabytes of raw and derived data per year for approximately 15 years.
E N D
Data Management by Cor Cornelisse
Introduction • Data Intensive Applications: • Physics (particle accelerators) • Simulated Science (super computers) • The Large Hadron Collider (LHC) at CERN produces several petabytes of raw and derived data per year for approximately 15 years.
File Storage Systems • Tree Storage System • Meta Attributes • Remote File Storage • Distributed File Storage
Tree Storage System • Filesystem has 1 root directory • Each directory has • Files • Directories
Meta Attributes • File name • File size • File type • Last modified date • Last accessed date • Creation data • Owner • Permissions • Description
Remote File Storage • Files are not stored on the local machine but on a remote machine • Common Goal: Transparency for user and applications • Usual implementation: Locator for file storage consisting of host and share name (Samba NFS) • Problem: files cannot be moved to a different host
Distributed File Storage • Target: Keep actual host out of file locator • Solution: Introduce Realms instead of single hosts • Locator now points to Realm, path relative to that locator
Overall problem • Scenario: • Wide diversity in Storage Systems • All have their own protocols (which are often incompatible)
Solution • Layered client or gateway • Extra Layer • Sophisticated • Hard to keep up with all the different protocols • Common data transfer protocol • Greater reliability • Performance increase
Basic Data Management Mechanisms • GridFTP • OGSA-DAI (Data Access and Integration) • Metadata Catalog Service (MCS)
GridFTP • Extensions to FTP Protocol: • Third-party control of data transfer • Parallel data transfer • Striped data transfer • Partial file transfer • Automatic negotiation of TCP buffer/window sizes • Support for reliable and restart able data transfers
GridFTP (cont’d) – Implementations - 1 • Globus_ftp_control_library: • Separate channels allowing (parallel, striped an third-party data transfers) • Control Channel (authentication, creation of control and data channels, reading and writing over data channels) • Multiple Data Channels
GridFTP (cont’d) – Implementations - 2 • Globus_ftp_client_library: • Complete File get and put operations • Set the level of parallelism • Partial file transfer operations • Third-party transfers • Eventually functions to set TCP buffer sizes • Support for Automatic negotiation of TCP Buffer/window sizes (not yet implemented)
OGSA-DAI • Supports data access, insert and update • Relational: MySQL, Oracle, DB2, SQL Server, Postgres • XML: Xindice, eXist • Files: CSV, BinX, EMBL, OMIM, SWISSPROT,… • Supports data delivery • SOAP over HTTP • FTP; GridFTP • E-mail • Inter-service • Supports data transformation • XSLT • ZIP; GZIP • Supports security • X.509 certificate based security
Metadata Catalog Service • Logical file • Logical collection • Logical view • Authorization • Annotation • Creation and transformation history • User defined attributes
Replica Management • Maintain a mapping between logical names for files and collections and one or more physical locations • Important for many applications • Example: CERN HLT data • Multiple petabytes of data per year • Copy of everything at CERN (Tier 0) • Subsets at national centers (Tier 1) • Smaller regional centers (Tier 2) • Individual researchers will have copies
Replica Management (cont’d) • Globus toolkit: • Replica catalog definition • LDAP object classes for representing logical-to-physical mappings in an LDAP catalog • Low-level replica catalog API • globus_replica_catalog library • Manipulates replica catalog: add, delete, etc. • High-level reliable replication API • globus_replica_manager library • Combines calls to file transfer operations and calls to low-level API functions: create, destroy, etc.
Example Replica Catalog Logical Collection C02 measurements 1998 Logical Collection C02 measurements 1999 Filename: Jan 1998 Filename: Feb 1998 … Logical File Parent Location jupiter.isi.edu Location sprite.llnl.gov Filename: Mar 1998 Filename: Jun 1998 Filename: Oct 1998 Protocol: gsiftp UrlConstructor: gsiftp://jupiter.isi.edu/ nfs/v6/climate Filename: Jan 1998 … Filename: Dec 1998 Protocol: ftp UrlConstructor: ftp://sprite.llnl.gov/ pub/pcmdi Logical File Jan 1998 Logical File Feb 1998 Size: 1468762