1 / 25

Data Management

Data Management. by Cor Cornelisse. Introduction. Data Intensive Applications: Physics (particle accelerators) Simulated Science (super computers) The Large Hadron Collider (LHC) at CERN produces several petabytes of raw and derived data per year for approximately 15 years.

yosef
Download Presentation

Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management by Cor Cornelisse

  2. Introduction • Data Intensive Applications: • Physics (particle accelerators) • Simulated Science (super computers) • The Large Hadron Collider (LHC) at CERN produces several petabytes of raw and derived data per year for approximately 15 years.

  3. Introduction (cont’d)

  4. File Storage Systems • Tree Storage System • Meta Attributes • Remote File Storage • Distributed File Storage

  5. Tree Storage System • Filesystem has 1 root directory • Each directory has • Files • Directories

  6. Meta Attributes • File name • File size • File type • Last modified date • Last accessed date • Creation data • Owner • Permissions • Description

  7. Remote File Storage • Files are not stored on the local machine but on a remote machine • Common Goal: Transparency for user and applications • Usual implementation: Locator for file storage consisting of host and share name (Samba NFS) • Problem: files cannot be moved to a different host

  8. Distributed File Storage • Target: Keep actual host out of file locator • Solution: Introduce Realms instead of single hosts • Locator now points to Realm, path relative to that locator

  9. Overall problem • Scenario: • Wide diversity in Storage Systems • All have their own protocols (which are often incompatible)

  10. Solution • Layered client or gateway • Extra Layer • Sophisticated • Hard to keep up with all the different protocols • Common data transfer protocol • Greater reliability • Performance increase

  11. Basic Data Management Mechanisms • GridFTP • OGSA-DAI (Data Access and Integration) • Metadata Catalog Service (MCS)

  12. GridFTP • Extensions to FTP Protocol: • Third-party control of data transfer • Parallel data transfer • Striped data transfer • Partial file transfer • Automatic negotiation of TCP buffer/window sizes • Support for reliable and restart able data transfers

  13. Striping

  14. GridFTP (cont’d) – Implementations - 1 • Globus_ftp_control_library: • Separate channels allowing (parallel, striped an third-party data transfers) • Control Channel (authentication, creation of control and data channels, reading and writing over data channels) • Multiple Data Channels

  15. GridFTP (cont’d) – Implementations - 2 • Globus_ftp_client_library: • Complete File get and put operations • Set the level of parallelism • Partial file transfer operations • Third-party transfers • Eventually functions to set TCP buffer sizes • Support for Automatic negotiation of TCP Buffer/window sizes (not yet implemented)

  16. GridFTP (cont’d) - Performance

  17. OGSA-DAI • Supports data access, insert and update • Relational: MySQL, Oracle, DB2, SQL Server, Postgres • XML: Xindice, eXist • Files: CSV, BinX, EMBL, OMIM, SWISSPROT,… • Supports data delivery • SOAP over HTTP • FTP; GridFTP • E-mail • Inter-service • Supports data transformation • XSLT • ZIP; GZIP • Supports security • X.509 certificate based security

  18. OGSA-DAI (cont’d)

  19. Metadata Catalog Service • Logical file • Logical collection • Logical view • Authorization • Annotation • Creation and transformation history • User defined attributes

  20. MCS (cont’d) overview

  21. MCS (cont’d)

  22. Replica Management • Maintain a mapping between logical names for files and collections and one or more physical locations • Important for many applications • Example: CERN HLT data • Multiple petabytes of data per year • Copy of everything at CERN (Tier 0) • Subsets at national centers (Tier 1) • Smaller regional centers (Tier 2) • Individual researchers will have copies

  23. Replica Management (cont’d) • Globus toolkit: • Replica catalog definition • LDAP object classes for representing logical-to-physical mappings in an LDAP catalog • Low-level replica catalog API • globus_replica_catalog library • Manipulates replica catalog: add, delete, etc. • High-level reliable replication API • globus_replica_manager library • Combines calls to file transfer operations and calls to low-level API functions: create, destroy, etc.

  24. Example Replica Catalog Logical Collection C02 measurements 1998 Logical Collection C02 measurements 1999 Filename: Jan 1998 Filename: Feb 1998 … Logical File Parent Location jupiter.isi.edu Location sprite.llnl.gov Filename: Mar 1998 Filename: Jun 1998 Filename: Oct 1998 Protocol: gsiftp UrlConstructor: gsiftp://jupiter.isi.edu/ nfs/v6/climate Filename: Jan 1998 … Filename: Dec 1998 Protocol: ftp UrlConstructor: ftp://sprite.llnl.gov/ pub/pcmdi Logical File Jan 1998 Logical File Feb 1998 Size: 1468762

  25. The End

More Related