330 likes | 513 Views
Remote Data Access Working Group. Introductory Session. Remote Data Access Working Group Grid Forum 5 Reagan Moore. Summary of Working Group Activities Challenges: Rapid evolution of grid environments Pressure of application implementation Interactions with Grid Forum working groups.
E N D
Remote Data Access Working Group Introductory Session
Remote Data Access Working GroupGrid Forum 5Reagan Moore Summary of Working Group Activities Challenges: Rapid evolution of grid environments Pressure of application implementation Interactions with Grid Forum working groups
Organization • Name: Remote Data Access Working Group • Chairs: Reagan Moore, Ann Chervenak, John Karpovich • Document Editor: Eric Stephan • Charter: Interoperability between remote data access systems • Short-term goals: • Review “Summary of Data Grids” • Define framework for common functionality across data grids
Working Group Liaisonsfor Requirements Lists Accounting Ed Hanna Grid Performance Brian Tierney Information Ann Chervenak Program Models Tracey Smith/Craig Lee Scheduling Judy Beiriger Security (Open position) User Services Judith Utley (Applications and Tools ) Ron Oldfield
Semantics for Data Access • File based access • User owns the files • Globus, Nile, IBP • Object based access • Object is member of a class • Legion, CORBA, “Objectivity” • Collection based access • Collection owns files • Storage Resource Broker, Digital libraries
Remote Data Access Architecture Convergence Application Application ? FTP Client SRBClient Replica Catalog Metadata Catalog Metadata Catalog ? FTP Daemon SRB Server Storage System Storage System SDSC Storage Resource Broker Globus
Evolution of Data Management • A grid supports • Data management • Access to distributed storage systems • Users also require • Information management • Tagged attributes of the stored data sets • Knowledge management • Relationships between the concepts described by the data set attributes
Architecture API that provides “glue” to underlying data handling systems (security, scheduling, QoS, access protocol, data format/model, adaptivity, info discovery, location control) Application + authentication + authorization Data Model Management Remote Procedure Execution Armada D’agents, FEL, ADR GRAM, SRB Information Discovery Data Handling Systems Condor, GASS, NILE, [SRB], I-2 caching (e.g., filtering) API that provides “glue” to underlying storage, QoS, etc. [GASS, IBP, SRB] Dynamic Info Discovery Storage System Description Storage Resources DPSS, HPSS, ADSM, DMF, Unitree, NASstore, DFS, DB2, Oracle, Illustra, Sybase, O2, ObjectStore, Objectivity (which perf. Monitor, what QoS, location, what access control, replication) GloPerf, Netlogger, NWS
Information Based Grid Management Access Services Tagging of data Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF
Knowledge Based Grid Management Access Services Tagging of data Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Topic Maps / Buckets / Model-based Access) Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF
Emerging Applications • Virtual Data Products • NSF GriPhyN ITR project • Dynamically create product by application of analysis procedures • Information Repositories • Protein Data Bank • Support application of structural comparison algorithms • Collections • National Virtual Observatory • Federate sky surveys
Current Papers • Remote Data Access Architectures • Presented at GF4 • Summary/survey of existing data grids • Presented at GF4 • Data Transport Protocol • GridFTP presentation at GF5
Grid Forum 5 Sessions • Monday 11:00 - XML Tutorial • Information tagging • Relationship tagging • Monday 4:30 - GF/eGRID survey • Working group session to identify requirements • Tuesday 3:00 - GridFTP specification • Working group session on data transport protocol
DATA Working Groups • GF/eGRID discussion • GridFTP discussion
Grid Forum Architecture Working Group • Discussion of need for: • Network services perspective for designing protocols and APIs for Grid Forum services • Distributed Operating system perspective for designing an architecture (naming, binding, persistence, process management, storage) Led by Charlie Catlett
GF/eGRID DiscussionLed by Reagan Moore • What access protocols are of interest? • What latency hiding mechanisms are of interest? • Data streaming • Caching • Replicas • Containers for aggregation • Remote proxies for bundling I/O commands
GF/eGRID Discussion • What are data management requirements? • Data collections • Information catalogs • Knowledge repositories • What is the granularity of the data management systems? • Collection size • Object size • Data set access size
GF/eGRID Discussion • What is the time granularity? • (Execution rate) * (Number of operations) • (Transmission bandwidth) * (Number of bytes) • How many operations are done per byte accessed, Ops-per-Byte? • For your resources, is Ops-per-Byte ~ Execution rate / Bandwidth
GF/eGRID Discussion • Common application exists across Japan, US, and Europe for the high energy physics community (CMS, Atlas, Babar) • NSF GriPhyN • DOE PPDG • CERN DataGrid • Japan ETL-KEK data grid • Analyze event data generated at CERN
CERN Event Data • “File” oriented access • Latency is smaller than the analysis time • Objects managed as a collection • Collection - 1 PB/year, event is 1 MB in size, implies 1 billion events per year
Data Access Requirements • Current implementation • Global object namespace • Global schema • Each site replicates the catalog the manages the global namespace and global schema • Current data model is based upon Objectivity
Data Management • Objects identified by • Database/container/page/slot • Each database can be thought of as a file • Replication at the file level • Analysis time is 10-100 seconds per object • Suggests alternate management by • Object level access • Size of initial object is 1 MB • Derived products are 100 kB to 10 kB in size
Object Level Access • Manage 5 billion objects • Requires ability to • Export objects (encapsulated within XML) • Access individual objects within Objectivity • Definition of procedure for manipulating/subsetting an object • Maintains • Global namespace and global schema • Allows migration between collections
Common Requirements • Archive interface • Aggregation of objects into containers to minimize impact on archive namespace • Replication of objects to allow local analysis • Track where replicas are located to improve performance • Knowledge management for mapping between schema
GridFTP ProposalLed by Steven Tuecke • Extensions to the FTP standard • RFC 959 - FTP definition • RFC 2228 - Security • RFC 2389 - Feature negotiation • What extensions are needed by the Grid Forum to support large data transfers over wide area networks?
Grid FTP • Add • Security extension - GSI • Partial file transfer - Unix semantics • Parallel I/O • Striped I/O • Buffer, window size tuning • Recoverable data transfers • Progress monitoring
Timeline • E-Mail discussion of current draft • Next 2 months • Complete draft by June,2001 • Implementation by June, 2001 • Depending upon on further extensions • Definition of API is scope of another working group
Participants • Steven Tuecke <tuecke@mcs.anl.gov> • Bill Alcock • Lee Liming • Ann Chervenak <annc@ISI.EDU> • John Karpovich <karp@virginia.edu> • Dan Gunter <dkgunter@lbl.gov> • Tiziana Ferrari <ferrari@cnaf.infn.it> • Parkson Wong <parkson@nas.nasa.gov> • Heinz Stockinger <heinz.stockinger@cern.ch> • Samuel Meder <meder@mcs.anl.gov> • Reagan Moore <moore@sdsc.edu>