1 / 44

Data Grid Management Systems (DGMS)

Data Grid Management Systems (DGMS). Arun Jagatheesan San Diego Supercomputer Center arun@sdsc.edu. Acknowledgement: SDSC SRB Team. Arun Jagatheesan George Kremenek Sheau-Yen Chen Arcot Rajasekar Reagan Moore Michael Wan Roman Olschanowsky Bing Zhu Charlie Cowart Not In Picture:

bwhitesell
Download Presentation

Data Grid Management Systems (DGMS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Grid Management Systems (DGMS) Arun Jagatheesan San Diego Supercomputer Center arun@sdsc.edu

  2. Acknowledgement: SDSC SRB Team • Arun Jagatheesan • George Kremenek • Sheau-Yen Chen • Arcot Rajasekar • Reagan Moore • Michael Wan • Roman Olschanowsky • Bing Zhu • Charlie Cowart Not In Picture: • Wayne Schroeder • Tim Warnock(BIRN) • Lucas Gilbert • Marcio Faerman (SCEC) • Antoine De Torcy Students: Xi (Cynthia) Sheng Allen Ding Grace Lin Jonathan Weinberg Yufang Hu Yi Li Emeritus: Vicky Rowley (BIRN) Qiao Xin Daniel Moore Ethan Chen Reena Mathew Erik Vandekieft Ullas Kapadia

  3. Talk Outline • Grid Computing and Data Grids • Inter-organizational Information Management using Data Grids • Gridflows and Data Grids • Opportunities and Challenges

  4. Grid as Utility Computing

  5. NIH BIRN Data Grid • Biomedical Informatics Research Network • Medical schools and research centers across the country • Access and analyze biomedical images • Coordinate sharing of data and storage resources • Storage virtualization, Data virtualization, Inter-organizational information virtualization • Inter and Intra Organizational Information Storage Management

  6. BIRN: Inter-organizational Data

  7. NTON vBNS Abilene Calren ESnet OC-12 vBNS Abilene MREN OC-12 OC-3 TeraGrid:13.6 TF, 6.8 TB memory, 900 TB network disk, 10 PB archive Caltech 0.5 TF .4 TB Memory 86 TB disk Extreme Blk Diamond ANL 1 TF .25 TB Memory 25 TB disk 574p IA-32 Chiba City 256p HP X-Class 32 32 24 32 32 128p Origin 128p HP V2500 24 32 HR Display & VR Facilities 24 92p IA-32 32 5 4 8 5 8 HPSS HPSS OC-48 OC-12 ESnet HSCC MREN/Abilene Starlight Chicago & LA DTF Core Switch/Routers Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) Calren Juniper M160 OC-12 ATM OC-48 OC-12 GbE SDSC 4.1 TF 2 TB Memory 500 TB SAN NCSA 6+2 TF 4 TB Memory 400 TB disk OC-12 OC-12 OC-12 OC-3 4 8 Myrinet HPSS 9 PB UniTree 8 2 Sun Server 4 Myrinet 1024p IA-32 320p IA-64 1176p IBM SP 1.7 TFLOPs Blue Horizon 14 16 15xxp Origin 4 2 x Sun E10K

  8. NASA Data Grids • NASA Information Power Grid • NASA Ames, NASA Goddard • Distributed data collection using the SRB • ESIP federation • Led by Joseph JaJa (U Md) • Federation of ESIP data resources using the SRB • NASA Goddard Data Management System • Storage repository virtualization (Unix file system, Unitree archive, DMF archive) using the SRB • NASA EOS Petabyte store • Storage repository virtualization for EMC persistent store using the Nirvana version of SRB

  9. Southern California Earthquake Center

  10. Southern California Earthquake Center • Build community digital library • Manage simulation and observational data • Anelastic wave propagation output • 10 TBs, 1.5 million files • Provide web-based interface • Support standard services on digital library • Manage data distributed across multiple sites • USC, SDSC, UCSB, SDSU, SIO • Provide standard metadata • Community based descriptive metadata • Administrative metadata • Application specific metadata

  11. Pipeline could be triggered by input at data source or by a data request from user Pipeline could be triggered by input at data source or by a data request from user Gridflow in SCEC (data  information pipeline) Metadata derivation Ingest Data Ingest Metadata Determine analysis pipeline Initiate automated analysis Use the optimal set of resources based on the task – on demand Organize result data into distributed data grid collections All gridflow activities stored for data flow provenance and querying

  12. Commonality in all these projects • Distributed data management • Data Grids, Digital Libraries, Persistent Archives • Workflow/dataflow Pipelines • Data sharing across administrative domains • Common logical name space for all registered digital entities • Data publication • Browsing and discovery of data in collections • Data Preservation • Management of technology evolution

  13. Talk Outline • Grid Computing and Data Grids • Inter-organizational Information Management using Data Grids • SDSC Storage Resource Broker (SRB) • Gridflows and Data Grids • Opportunities and Challenges

  14. Data delivered Ask for data • The data is found and returned • Where & how details are managed by data grid • But access controls are specified by owner Using a Data Grid – in Abstract Data Grid • User asks for data from the data grid

  15. Common Data Grid Components • Federated client-server architecture • Servers can talk to each other independently of the client • Infrastructure independent naming • Logical names for users, resources, files, applications • Collective ownership of data • Collection-owned data, with infrastructure independent access control lists • Context management • Record state information in a metadata catalog from data grid services such as replication • Abstractions for dealing with heterogeneity

  16. Data Grid Transparencies • Find data without knowing the identifier • Descriptive attributes • Access data without knowing the location • Logical name space • Access data without knowing the type of storage • Storage repository abstraction • Retrieve data using your preferred API • Access abstraction • Provide transformations for any data collection • Data behavior abstraction

  17. myActiveNeuroCollection patientRecordsCollection image.cgi image.wsdl image.sql E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where... Logical Layers (bits,data,information,..) Inter-organizational Information Storage Management Semantic data Organization (with behavior) Information Virtualization Virtual Data Transparency Data Replica Transparency image_0.jpg…image_100.jpg Data Virtualization Data Identifier Transparency Storage Location Transparency Storage Resource Transparency Storage Virtualization

  18. Talk Outline • Grid Computing and Data Grids • Inter-organizational Information Management using Data Grids • SDSC Storage Resource Broker (SRB) • Gridflows and Data Grids • Opportunities and Challenges

  19. SDSC Storage Resource Broker • Distributed data management technology • Developed at San Diego Supercomputer Center (Univ. of California, San Diego) • 1996 - DARPA Massive Data Analysis • 1998 - DARPA/USPTO Distributed Object Computation Testbed • 2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC • Applications • Data grids - data sharing • Digital libraries - data publication • Persistent archives - data preservation • Used in national and international projects in support of Astronomy, Bio-Informatics, Biology, Earth Systems Science, Ecology, Education, Geology, Government records, High Energy Physics, Seismology

  20. SRB Data Collections at SDSC

  21. SRB Data Grid Environments • NSF Southern California Earthquake Center digital library • Worldwide Universities Network data grid • NASA Information Power Grid • NASA Goddard Data Management System data grid • DOE BaBar High Energy Physics data grid • NSF National Virtual Observatory data grid • NSF ROADnet real-time sensor collection data grid • NIH Biomedical Informatics Research Network data grid • NARA research prototype persistent archive • NSF National Science Digital Library persistent archive • NHPRC Persistent Archive Testbed

  22. Why they use SRB for Data Management? Digital Libraries • Logical Namespace of data collections • Replica Transparency and management • Meta-data Management • Storage resource virtualization • Information/data virtualization • Latency Management and bulk operations • Virtual data abstraction • Inter or Intra-organizational namespace of data and storage resources • Preserve data irrespective of technology evolution (say for 400 years) Data on demand Data Grids Persistent Archives

  23. Our “magic” recipe • Every thing in the world is abstract (kind of Matrix) • Databases: • Physical layer, logical layer • Physical layer not visible to user • SRB • Physical layer, logical layer, inter-zone layer • Every this is logical – but visible to users (as long as they have access)

  24. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog (One Zone) Application Linux I/O OAI WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication SRB Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase, SQLServer SRB Drivers HRM Our Magic Recipe: Every thing in this world is “abstract”

  25. Federated SRB server model Peer-to-peer Brokering Read Client Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 R1 Data Access MCAT

  26. SRB Name Spaces • Digital Entities (files, blobs, Structured data, ,) • Logical name space for files for global identifiers • Resources • Logical names for managing collections of resources • User names (user-name / domain / SRB-zone) • Distinguished names for users to manage access controls • MCAT metadata • Standard metadata attributes,administrative metadata, user-defined meta

  27. Data Grid Federation • Data grids provide the ability to name, organize, and manage data on distributed storage resources • Federation provides a way to name, organize, and manage data on multiple data grids.

  28. SRB Zones • Each SRB zone uses a metadata catalog (MCAT) to manage the context associated with digital content • Each SRB zone is an autonomous organization or a sub-organization • Context includes: • Administrative, descriptive, authenticity attributes • Users • Resources • Applications

  29. SRB Peer-to-Peer Federation • Mechanisms to impose consistency and access constraints on: • Resources • Controls on which zones may use a resource • User names (user-name / domain / SRB-zone) • Users may be registered into another domain, but retain their home zone, similar to Shibboleth • Data files • Controls on who specifies replication of data • MCAT metadata • Controls on who manages updates to metadata

  30. Peer-to-Peer Federation • Occasional Interchange- for specified users • Replicated Catalogs- entire state information replication • Resource Interaction- data replication • Replicated Data Zones- no user interactions between zones • Master-Slave Zones- slaves replicate data from master zone • Snow-Flake Zones- hierarchy of data replication zones • User / Data Replica Zones- user access from remote to home zone • Nomadic Zones “SRB in a Box”- synchronize local zone to parent • Free-floating “myZone”- synchronize without a parent zone • Archival “Backup Zone”- synchronize to an archive SRB Version 3.0.1 released December 19, 2003

  31. Application OAI, WSDL, OGSA DLL / Python, Perl Linux I/O Java, NT Browsers HTTP Federation Management Consistency & Metadata Management / Authorization-Authentication Audit C, C++, Java Libraries Unix Shell Latency Management Metadata Transport Logical Name Space Data Transport Catalog Abstraction Storage Repository Virtualization Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix ORB Data Grid Federation - zoneSRB

  32. Talk Outline • Grid Computing and Data Grids • Inter-organizational Information Management using Data Grids • SDSC Storage Resource Broker (SRB) • Gridflows and Data Grids • Opportunities and Challenges

  33. Gridflows • Grid Workflow (Gridflow) is the automation of a execution pipeline in which data or tasks are processed through multiple autonomous grid resources according to a set of procedural rules • Gridflows are executed on resources that are dynamically obtained through confluence of one or more autonomous administrative domains (peers)

  34. Need for Gridflows • Data-intensive and/or compute-intensive processes • Long run processes or pipelines on the Grid • (e.g) If job A completes execute jobs x, y, z; else execute job B. • Self-organization/management of data • Semi-automation of data, storage distribution, curation processes • (e.g) After each data insert into a collection, update the meta-data information about the collection or replicate the collection • Knowledge Generation • Offline data analysis and knowledge generation pipelines • (e.g) What inferences can be assumed from the new seismology graphs added to this collection? Which domain scientist will be interested to study these new possible pre-results?

  35. SDSC Matrix Project • R&D effort that is ready for production now • Gridflow Protocols • Gridflow Language Descriptions • Version 3.0 released • Community based • Both Industry and Academia can benefit by participation • Involves University of Florida, UCSD, … (Are you In?) • Multiple Projects could be benefited • Very large academic data grid projects • Industries which want to be the early adopters

  36. Talk Outline • Grid Computing and Data Grids • Inter-organizational Information Management using Data Grids • SDSC Storage Resource Broker (SRB) • Gridflows and Data Grids • Opportunities and Challenges

  37. DGMS Philosophy • Collective view of • Inter-organizational data • Operations on datagrid space • Local autonomy and global state consistency • Collaborative datagrid communities • Multiple administrative domains or “Grid Zones” • Self-describing and self-manipulating data • Horizontal and vertical behavior • Loose coupling between data and behavior (dynamically) • Relationships between a digital entity and its Physical locations, Logical names, Meta-data, Access control, Behavior, “Grid Zones”.

  38. DGMS Research Issues • Self-organization of datagrid communities • Using knowledge relationships across the datagrids • Inter-datagrid operations based on semantics of data in the communities (different ontologies) • High speed data transfer • Terabyte to transfer • Protocols, routers • Latency Management • Data source speed >> data sink speed • Datagrid Constraints • Data placement and scheduling • How many replicas, where to place them…

  39. Global Grid Forum (GGF) • Global Forum for Information Exchange and Collaboration • Promote and support the development and deployment of Grid Technologies • Creation and documentation of “best practices”, technical specifications (standards), user experiences, … • Modeled after Internet Standards Process (IETF, RFC 2026) • http://www.ggf.org

  40. Talk Outline • Grid Computing and Data Grids • Inter-organizational Information Management using Data Grids • SDSC Storage Resource Broker (SRB) • Gridflows and Data Grids • Opportunities and Challenges • Summary

  41. Let us share dreams to make them for real Arun S. Jagatheesan San Diego Supercomputer Center arun@sdsc.edu srb@sdsc.edu http://www.npaci.edu/DICE/SRB http://www.npaci.edu/DICE/SRB/matrix/

  42. inQ Windows Browser Interface

  43. mySRB Interface to a SRB Collection

  44. Provenance Metadata

More Related