1 / 164

Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC) University of California, San Diego {arun, moore} @s

Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC) University of California, San Diego {arun, moore} @sdsc.edu. Storage Resource Broker. Distributed data management technology Developed at San Diego Supercomputer Center (Univ. of California, San Diego)

jock
Download Presentation

Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC) University of California, San Diego {arun, moore} @s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC) University of California, San Diego {arun, moore} @sdsc.edu

  2. Storage Resource Broker • Distributed data management technology • Developed at San Diego Supercomputer Center (Univ. of California, San Diego) • 1996 - DARPA Massive Data Analysis • 1998 - DARPA/USPTO Distributed Object Computation Testbed • 2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC • Applications • Data grids - data sharing • Digital libraries - data publication • Persistent archives - data preservation • Used in national and international projects in support of Astronomy, Bio-Informatics, Biology, Earth Systems Science, Ecology, Education, Geology, Government records, High Energy Physics, Seismology

  3. Acknowledgement: SDSC SRB Team • Arun Jagatheesan • George Kremenek • Sheau-Yen Chen • Arcot Rajasekar • Reagan Moore • Michael Wan • Roman Olschanowsky • Bing Zhu • Charlie Cowart Not In Picture: • Wayne Schroeder • Tim Warnock(BIRN) • Lucas Gilbert • Marcio Faerman (SCEC) • Antoine De Torcy Students: Xi (Cynthia) Sheng Allen Ding Grace Lin Jonathan Weinberg Yufang Hu Yi Li Emeritus: Vicky Rowley (BIRN) Qiao Xin Daniel Moore Ethan Chen Reena Mathew Erik Vandekieft Ullas Kapadia

  4. Tutorial Outline • Introduction • Data Grids • Data Grid Infrastructures • Information Management using Data Grids • Data Grid Transparencies and concepts • Peer-to-peer Federation of Data Grids • Gridflows and Data Grids • Need for Gridflows • Data Grid Language and SDSC Matrix Project • Lets build a Data Grid • Using SDSC SRB Data Grid Management System and its Interfaces

  5. Data Grid Goals • Automate all aspects of data analysis • Data discovery • Data access • Data transport • Data manipulation • Automate all aspects of data collections • Metadata generation • Metadata organization • Metadata management • Preservation

  6. Data delivered Ask for data • The data is found and returned • Where & how details are managed by data grid • But access controls are specified by owner Using a Data Grid – in Abstract Data Grid • User asks for data from the data grid

  7. Tutorial Outline • Introduction • Data Grids • Data Grid Infrastructures • Information Management using Data Grids • Data Grid Transparencies and concepts • Peer-to-peer Federation of Data Grids • Gridflows and Data Grids • Need for Gridflows • Data Grid Language and SDSC Matrix Project • Data Grids and You • Open Research Issues and Global Grid Forum Community • Lets build a Data Grid • Using SDSC SRB Data Grid Management System and its Interfaces

  8. SRB Environments • NSF Southern California Earthquake Center digital library • Worldwide Universities Network data grid • NASA Information Power Grid • NASA Goddard Data Management System data grid • DOE BaBar High Energy Physics data grid • NSF National Virtual Observatory data grid • NSF ROADnet real-time sensor collection data grid • NIH Biomedical Informatics Research Network data grid • NARA research prototype persistent archive • NSF National Science Digital Library persistent archive • NHPRC Persistent Archive Testbed

  9. Southern California Earthquake Center

  10. Southern California Earthquake Center • Build community digital library • Manage simulation and observational data • Anelastic wave propagation output • 10 TBs, 1.5 million files • Provide web-based interface • Support standard services on digital library • Manage data distributed across multiple sites • USC, SDSC, UCSB, SDSU, SIO • Provide standard metadata • Community based descriptive metadata • Administrative metadata • Application specific metadata

  11. SCEC Digital Library Technologies • Portals • Knowledge interface to the library, presenting a coherent view of the services • Knowledge Management Systems • Organize relationships between SCEC concepts and semantic labels • Process management systems • Data processing pipelines to create derived data products • Web services • Uniform capabilities provided across SCEC collections • Data grid • Management of collections of distributed data • Computational grid • Access to distributed compute resources • Persistent archive • Management of technology evolution

  12. Metadata Organization (Domain View versus Run View) Provenance Simulation Model Program Computer System Velocity Model Fault Model Domain ... Spatial Temporal Physical Numerical Run Output Domain List Formatting

  13. NASA Data Grids • NASA Information Power Grid • NASA Ames, NASA Goddard • Distributed data collection using the SRB • ESIP federation • Led by Joseph JaJa (U Md) • Federation of ESIP data resources using the SRB • NASA Goddard Data Management System • Storage repository virtualization (Unix file system, Unitree archive, DMF archive) using the SRB • NASA EOS Petabyte store • Storage repository virtualization for EMC persistent store using the Nirvana version of SRB

  14. Data Assimilation Office HSI has implemented metadata schema in SRB/MCAT Origin:host, path, owner, uid, gid, perm_mask, [times] Ingestion:date, user, user_email, comment Generation:creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data Data description:title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality Fully compatible with GCMD

  15. Data Management System: Software Architecture

  16. DODS Access Environment Integration

  17. National Virtual Observatory Data Grid 1. Portals and Workbenches 2.Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Standard APIs and Protocols Concept space 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7.

  18. National Virtual Observatory Portals, User Interfaces, Tools Aladin VOPlot SkyQuery Topcat OASIS DIS Mirage conVOT ADQLXQuery datamodels Bulk processing Registry Layer Data Access Layer Computational Services HTTP and SOAP Web Services visualization data mining OAI ADS source detection image OpenSkyQuery SIAP, SSAP VOTable Semantics (UCD) FITS, GIF,… Pipelines(persistent grid services) Virtual Data(dynamic and cached computation) Digital LibraryXML, DC, METS Grid MiddlewareSRB, OGSA, WSRFSOAP, GridFTP My Space Existing Data Centers Databases, Persistency, Replication Disks, Tapes, CPUs, Fiber

  19. OC-12 vBNS Abilene MREN OC-12 OC-3 TeraGrid:13.6 TF, 6.8 TB memory, 900 TB network disk, 10 PB archive ANL 1 TF .25 TB Memory 25 TB disk Caltech 0.5 TF .4 TB Memory 86 TB disk Extreme Blk Diamond 574p IA-32 Chiba City 256p HP X-Class 32 32 24 32 32 128p HP V2500 128p Origin 24 32 24 92p IA-32 32 HR Display & VR Facilities 5 4 8 5 8 HPSS HPSS NTON OC-48 Calren OC-12 ESnet HSCC MREN/Abilene Starlight Chicago & LA DTF Core Switch/Routers Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) Juniper M160 OC-12 ATM OC-48 OC-12 GbE NCSA 6+2 TF 4 TB Memory 400 TB disk SDSC 4.1 TF 2 TB Memory 500 TB SAN vBNS Abilene Calren ESnet OC-12 OC-12 OC-12 OC-3 Myrinet 4 8 HPSS 9 PB UniTree 8 2 Sun Server Myrinet 4 1024p IA-32 320p IA-64 1176p IBM SP 1.7 TFLOPs Blue Horizon 14 16 15xxp Origin 4 2 x Sun E10K

  20. NIH BIRN SRB Data Grid • Biomedical Informatics Research Network • Access and analyze biomedical image data • Data resources distributed throughout the country • Medical schools and research centers across the US • Stable high performance grid based environment • Coordinate data sharing • Federate collections • Support data mining and analysis

  21. SRB Collections at SDSC

  22. Commonality in all these projects • Distributed data management • Data Grids, Digital Libraries, Persistent Archives, • Workflow/dataflow Pipelines, Knowledge Generation • Data sharing across administrative domains • Common name space for all registered digital entities • Data publication • Browsing and discovery of data in collections • Data Preservation • Management of technology evolution

  23. Common Data Grid Components • Federated client-server architecture • Servers can talk to each other independently of the client • Infrastructure independent naming • Logical names for users, resources, files, applications • Collective ownership of data • Collection-owned data, with infrastructure independent access control lists • Context management • Record state information in a metadata catalog from data grid services such as replication • Abstractions for dealing with heterogeneity

  24. Tutorial Outline • Introduction • Data Grids • Data Grid Infrastructures • Information Management using Data Grids • Data Grid Transparencies and concepts • Peer-to-peer Federation of Data Grids • Gridflows and Data Grids • Need for Gridflows • Data Grid Language and SDSC Matrix Project • Lets build a Data Grid • Using SDSC SRB Data Grid Management System and its Interfaces

  25. Information Management Technologies • Data collecting • Sensor systems, object ring buffers and portals • Data organization • Collections, manage data context • Data sharing • Data grids, manage heterogeneity • Data publication • Digital libraries, support discovery • Data preservation • Persistent archives, manage technology evolution • Data analysis • Processing pipelines, manage knowledge extraction

  26. Assertion • Data Grids provide the underlying abstractions required to support all information technologies • Collection building • Metadata extraction • Digital libraries • Curation processes • Distributed collections • Discovery and presentation services • Persistent archives • Management of technology evolution • Preservation of authenticity

  27. Information Management Terms • Data • Bits - zeros and ones • Digital Entity • The bits that form an image of reality (file, object, image, data, metadata, string of bits, structured sets of string of bits) • Metadata • Semantic labels and the associated data • Information • Semantic labels applied to data and its semantic properties • Knowledge • Relationships between semantic labels associated with the data • Relationships used to assert the application of a semantic label

  28. Information Management data types • Collection • The organization of digital entities to simplify management and access. • Context • The information that describes the digital entities in a collection. • Content • The digital entities in a collection

  29. Types of Context Metadata • Descriptive • Provenance information, discovery attributes • Administrative • Location, ownership, size, time stamps • Structural • Data model, internal components • Behavioral • Display and manipulation operations • Authenticity • Audit trails, checksums, access controls

  30. Some Metadata Standards • METS - Metadata Encoding Transmission Standard • Defines standard structure and schema extension • OAIS - Open Archival Information System • Preservation packages for submission, archiving, distribution • OAI - Open Archives Initiative • Metadata retrieval based on Dublin Core provenance attributes

  31. Data Management Mechanisms • Curation • The process of creating the context • Closure • Assertion that the collection has global properties, including completeness and homogeneity under specified operations • Consistency • Assertion that the context represents the content

  32. Storage Resource Broker • Implements data management mechanisms needed to automate • Collection building • Context management • Content management • Curation processes • Closure and validation processes • Consistency guarantees • Provides virtualization mechanisms to manage • Distribution across administrative domains • Heterogeneous storage resources

  33. myActiveNeuroCollection patientRecordsCollection image.cgi image.wsdl image.sql E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where... Data Grid Transparencies/Virtualizations (bits,data,information,..) Inter-organizational Information Storage Management Semantic data Organization (with behavior) Virtual Data Transparency Data Replica Transparency image_0.jpg…image_100.jpg Data Identifier Transparency Storage Location Transparency Storage Resource Transparency

  34. Data Grid Transparencies • Find data without knowing the identifier • Descriptive attributes • Access data without knowing the location • Logical name space • Access data without knowing the type of storage • Storage repository abstraction • Retrieve data using your preferred API • Access abstraction • Provide transformations for any data collection • Data behavior abstraction

  35. Data Grid Abstractions • Storage repository virtualization • Standard operations supported on storage systems • Data virtualization • Logical name space for files - Global persistent identifier • Information repository virtualization • Standard operations to manage collections in databases • Access virtualization • Standard interface to support alternate APIs • Latency management mechanisms • Aggregation, parallel I/O, replication, caching • Security interoperability • GSSAPI, inter-realm authentication, collection-based authorization

  36. Storage Repository Virtualization User Application Database File System Archive

  37. Storage Repository Virtualization Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries User Application Common set of operations for interacting with every type of storage repository Database File System Archive

  38. Data Virtualization User Application Database At U Md File System at U Texas Archive at SDSC

  39. Data Virtualization Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system User Application Common naming convention and set of attributes for describing digital entities Database At U Md File System at U Texas Archive at SDSC

  40. Three Tier Architecture • Clients • Your preferred access mechanism • Metadata catalog • Separation of metadata management from data storage • Servers • Manage interactions with storage systems • Federated to support direct interactions between servers

  41. Federated SRB server model Peer-to-peer Brokering Read Client Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 R1 Data Access MCAT

  42. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Application Linux I/O OAI WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication SRB Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase, SQLServer Drivers HRM

  43. SRB Name Spaces • Digital Entities (files, blobs, Structured data, …) • Logical name space for files for global identifiers • Resources • Logical names for managing collections of resources • User names (user-name / domain / SRB-zone) • Distinguished names for users to manage access controls • MCAT metadata • Standard metadata attributes, Dublin Core, administrative metadata

  44. Logical Name Space • Global, location-independent identifiers for digital entities • Organized as collection hierarchy • Attributes mapped to logical name space • Attributed managed in a database • Types of administrative metadata • Physical location of file • Owner, size, creation time, update time • Access controls

  45. Data Identifier Transparency Four Types of Data Identifiers: • Unique name • OID or handle • Descriptive name • Descriptive attributes – meta data • Semantic access to data • Collective name • Logical name space of a collection of data sets • Location independent • Physical name • Physical location of resource and physical path of data

  46. Mappings on Resource Name Space • Define logical resource name • List of physical resources • Replication • Write to logical resource completes when all physical resources have a copy • Load balancing • Write to a logical resource completes when copy exist on next physical resource in the list • Fault tolerance • Write to a logical resource completes when copies exist on “k” of “n” physical resources

  47. Data Replica Transparency • Replication • Improve access time • Improve reliability • Provide disaster backup and preservation • Physically or Semantically equivalent replicas • Replica consistency • Synchronization across replicas on writes • Updates might use “m of n” or any other policy • Distributed locking across multiple sites • Versions of files • Time-annotated snapshots of data

  48. Latency Management -Bulk Operations • Bulk register • Create a logical name for a file • Bulk load • Create a copy of the file on a data grid storage repository • Bulk unload • Provide containers to hold small files and pointers to each file location • Bulk delete • Mark as deleted in metadata catalog • After specified interval, delete file • Bulk metadata load • Requests for bulk operations for access control setting, …

More Related