1 / 34

integrated Rule Oriented Data System

integrated Rule Oriented Data System. Tutorial: iRODS Capabilities. Outline. Introduction to iRODS capabilities Data-driven science and full Data Life Cycle Policy-based Management of Distributed Data Scaling: petabytes, 100s of millions of files

zelig
Download Presentation

integrated Rule Oriented Data System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. integrated Rule Oriented Data System Tutorial: iRODS Capabilities

  2. Outline • Introduction to iRODS capabilities • Data-driven science and full Data Life Cycle • Policy-based Management of Distributed Data • Scaling: petabytes, 100s of millions of files • Enabling unified sharable "virtual" collections • Enabling data grids (sharing), digital libraries (publishing), persistent archives (preservation) • Unified Data Space: Interoperate via Federation

  3. Introduction to iRODS Capabilities

  4. Data Driven Science Enable new science through collaborative research on shared data collections Management of entire scientific data life cycle from data analysis pipelines to long-term sustainability of reference collections Implement national scale data cyber-infrastructure Federation of exemplar data management technologies in exemplar research initiatives Creation of production data management systems Proven technology implemented in extant data grids Integrate “live” research data collections into education initiatives Policy-based data management across distributed data Data Life Cycle Project Shared Collection Processing Pipeline Digital Library Reference Collection Federation

  5. Data are Inherently Distributed • Distributed sources • Projects span multiple institutions, nations • Distributed analysis platforms • Grid computing • Distributed data storage • Minimize risk of data loss, optimize access • Distributed users • Caching of data near user • Multiple stages of data life cycle • Data repurposing for use in broader context

  6. Institutional Repositories Cloud Storage Texas Digital Library Carolina Digital Repository Federal Repositories National Climatic Data Center National Optical Astronomy Observatory

  7. Digital Library French National Library Texas Digital Library Data Grid Temporal Dynamics of Learning Center Teragrid Australian Research Collaboration Service Data Processing Pipelines Large Synoptic Survey Telescope Ocean Observatories Initiative Preservation Environment Taiwan National Archive NARA Transcontinental Persistent Archive Prototype Carolina Digital Repository

  8. Data Life Cycle Each stage of the data life cycle re-purposes the original collection Project Collection Private Local Policy Data Grid Shared Distribution Policy Data Processing Pipeline Analyzed Service Policy Digital Library Published Description Policy Reference Collection Preserved Representation Policy Federation Sustained Re-purposing Policy Each stage adds new policies for a broader community Virtualize the stages of data life cycle through evolution of policies Interoperability across data life cycle representations

  9. Tracing the Data Life Cycle • Collection Creation using a Data Grid • Data manipulation / Data ingestion • Processing Pipelines • Pipeline processing / Environment administration • Data Grid • Policy display / Micro-service display / State information display / Replication • Digital Library • Access / Containers / Metadata browsing / Visualization • Preservation Environment • Validation / Audit / Federation / Deep Archive / SHAMAN

  10. Goal - Generic Infrastructure • Manage all stages of the data life cycle • Data organization • Data processing pipelines • Collection creation • Data sharing • Data publication • Data preservation • Create reference collection against which future information and knowledge is compared • Each stage uses similar storage, arrangement, description, and access mechanisms

  11. Concept Roadmap Purpose- reason a collection is assembled Properties- attributes needed to ensure the purpose Policies- enforce and maintain required properties Procedures– computer functions to implement Policies State information- results of applying procedures (iCAT) Assessment criteria- validate that state information conforms to desired purpose Federation– interoperate w/shared logical name spaces These are the required elements for data life cycle virtualization 11

  12. Policy-based Management • Each data life cycle stage is driven by extensions of management policies to address broader user communities • Data arrangement <-----> Project policies • Data analysis <-----------> Processing pipeline standards • Data sharing <-----------> Research collaborations • Data publication <---------> Discipline standards • Data preservation <------> Reference collection • Reference collections need to be preserved and interpretable by future generations, most stringent standard • Data grids - integrated Rule Oriented Data System

  13. iRODS - Policy-based Management • Turn Policies into computer-actionable Rules • Compose Rules by chaining Micro-services • Manage state information (in iCAT metadata catalog) as attributes on namespaces: • Files / collections /users / resources / rules • Validate assessment criteria • Queries on state information, parsing audit trails • Automate administrative functions • Enable scaling to today's massive collections

  14. Overview of iRODS Architecture User w/Client Can Search, Access, Add and Manage Data & Metadata iRODS Data System iRODS Metadata Catalog Track information iRODS Rule Engine Tracks Policies iRODS Data Server Disk, Tape, etc. Access distributed data with Web-based Browser or iRODS GUI or Command Line clients.

  15. iput ../src/irm.c - Checks 10 Policy hooks when file put into iRODS brick14:10900:ApplyRule#116:: acChkHostAccessControl brick14:10900:GotRule#117:: acChkHostAccessControl brick14:10900:ApplyRule#118:: acSetPublicUserPolicy brick14:10900:GotRule#119:: acSetPublicUserPolicy brick14:10900:ApplyRule#120:: acAclPolicy brick14:10900:GotRule#121:: acAclPolicy brick14:10900:ApplyRule#122:: acSetRescSchemeForCreate brick14:10900:GotRule#123:: acSetRescSchemeForCreate brick14:10900:execMicroSrvc#124:: msiSetDefaultResc(demoResc,null) brick14:10900:ApplyRule#125:: acRescQuotaPolicy brick14:10900:GotRule#126:: acRescQuotaPolicy brick14:10900:execMicroSrvc#127:: msiSetRescQuotaPolicy(off) brick14:10900:ApplyRule#128:: acSetVaultPathPolicy brick14:10900:GotRule#129:: acSetVaultPathPolicy brick14:10900:execMicroSrvc#130:: msiSetGraftPathScheme(no,1) brick14:10900:ApplyRule#131:: acPreProcForModifyDataObjMeta brick14:10900:GotRule#132:: acPreProcForModifyDataObjMeta brick14:10900:ApplyRule#133:: acPostProcForModifyDataObjMeta brick14:10900:GotRule#134:: acPostProcForModifyDataObjMeta brick14:10900:ApplyRule#135:: acPostProcForCreate brick14:10900:GotRule#136:: acPostProcForCreate brick14:10900:ApplyRule#137:: acPostProcForPut brick14:10900:GotRule#138:: acPostProcForPut brick14:10900:GotRule#139:: acPostProcForPut brick14:10900:GotRule#140:: acPostProcForPut

  16. Scale of iRODS Data Grid • Number of files • Desktop to 10s to 100s of millions of files • Size of data • Desktop to 100s of terabytes to petabytes of data • Number of policy enforcement points • 64 actions define when policies are checked • System state information • 112 metadata attributes of system information per file • Number of functions • 185 composable iRODS Micro-services • Number of storage systems that are linked • Desktop to 10s to 100 storage resources • Number of data grids that can interoperate • Federation of 10s of data grids

  17. iRODS Shows Unified “Virtual Collection” User With Client Views & Manages Data User Sees Single “Virtual Collection” Reference Data Remote Disk, Tape, Filesystem, etc. My Data Disk, Tape, Database, Filesystem, etc. Project Data Disk, Tape, Database, Filesystem, etc. The iRODS Data System can install in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection.

  18. Organize Distributed Data into a Sharable "Virtual" Collection • Project repository • MotifNet - manage collection of analysis products • Institutional repository • Carolina Digital Repository for UNC collections • Regional collaboration • RENCI Data Grid linking resources across North Carolina • National collaboration • NSF Temporal Dynamics of Learning Center • Australian Research Collaboration Service • National Library • French National Library • National Archive • NARA Transcontinental Persistent Archive Prototype, Taiwan • International collaboration • BaBar High Energy Physics (SLAC-IN2P3) • National Optical Astronomy Observatory (Chile-US)

  19. Infrastructure Independence • Manage properties of the collection independently of the choice of technology • Access, authentication, authorization, description, location, distribution, replication, integrity, retention • Enforce policies globally at all storage locations • Rule Engine resident at each storage site • Apply procedures at each remote storage site • Chain encapsulated operations into workflows • Infrastructure independence enables evolution to new technology without interruption • Integrate new access methods, new storage systems, new network protocols, new authentication systems

  20. Data Virtualization Map from actions requested by access method to standard set of iRODS Micro-services. Map standard Micro-services to standard operations. Map the operations to protocol supported by operating system. Access Interface Standard Micro-services Data Grid Standard Operations Storage Protocol Storage System

  21. Data Grid Security • Manage global name spaces for: • {users, files, storage} • Assign access controls as constraints imposed between two logical name spaces • Access controls remain invariant as files are moved within the data grid • Controls on: Files / Storage systems / Metadata • Authenticate each user access • PKI, Kerberos, challenge-response, Shibboleth • Use internal or external identity management system • Authorize all operations • ACLs (Access Control Lists) on users and groups • Separate condition for execution of each Rule • Internal approval flags (e.g. IRB) within a Rule

  22. NOAO Zone Architecture Telescope Telescope Archive

  23. Ocean Observatories Initiative Remote locations Aggregate sensor data in cache Event Detection Sensors Cloud Storage Cache Cloud Computing Message Bus iRODS Data Grid Clients SuperComputer Multiple Protocols Simulations Remote Users External Repositories Digital Library Archive Large-scale workflows from real-time data to steerable instruments, dig. Library.

  24. Access: Data Grid Clients

  25. iRODS Distributed Data Management

  26. Towards a Unified Data Space • Sharing data across Space • Organize data as a shared "virtual" Collection • Define unifying properties for the Collection • Sharing data across Time • Preservation is communication with the future • Preservation validates communication from the past • Managing full Data Life Cycle • Evolution of the Policies that govern a data Collection at each stage of the life cycle • From data creation, to collection, to publication, to reference collection, to analysis pipeline

  27. Intellectual Property • Given generic infrastructure, intellectual property resides in the Policies and procedures that manage the Collection • Consistency of the Policies • Capabilities of the procedures • Automation of internal Policy assessment • Validation of desired Collection properties • Automation of administrative tasks • Interacting with DataDirectNetwork, HP, IBM, MicroSoft on commercial application of open source technology.

  28. Societal Impact • Many communities are assembling digital holdings that represent an emerging consensus: • Common meaning associated with the data • Common interpretation of the data • Common data manipulation mechanisms • The development of a consensus is described as • Socialization of Collections • An example is Trans-border Urban Planning

  29. Social consensus for sharing data, policies, methods, practice Each community controls their own collection Policies Policies enforced at each storage location Explicit computer-actionable rules control type of federation interactions e.g. peer-to-peer, central archive, master-slave data distribution, chained data grids, deep archives Interoperability mechanisms support technology integration Community specific clients Bulk data export / import Cross registration of data Structured information resource drivers Federation of Collections

  30. Data Grid Federation • Motivation • Improve performance, scalability, and independence • To initiate Federation, each Data Grid administrator establishes trust and creates a remote user • iadmin mkzone B remote Host:Port • iadmin mkuser rods#B rodsuser • Use cases • Chained data grids - National Optical Astronomy Observatory • Master-slave data grids - NIH BIRN • Central archive - UK e-Science • Deep archive - NARA TPAP • Replication - NSF Teragrid

  31. Accessing Data in Federated iRODS Federated irodsUser (use iRODS clients) “With access permissions” “Finds the data” “Gets data to user” Two federated iRODS data grids iRODS/ICAT system at University of North Carolina at Chapel Hill (renci zone) iRODS/ICAT system at University of Texas at Austin (tacc zone) Federated irodsUsers can upload, download, replicate, share, manage & track access to some or all data (depending on access permissions) in either zone.

  32. Development Team • DICE team • Arcot Rajasekar - iRODS Development Lead • Mike Wan - iRODS Chief Architect • Wayne Schroeder - iRODS Product Mgr., Sr. Developer • Bing Zhu - Fedora, Windows • Mike Conway - Java (Jargon) • Paul Tooby - Documentation, Foundation • Sheau-Yen Chen - Data Grid Administration • Reagan Moore - PI • Preservation • Richard Marciano - Preservation Development Lead • Chien-Yi Hou - Preservation Micro-services • Antoine de Torcy - Preservation Micro-services

  33. Foundation • Data Intensive Cyber Environments Foundation • Nonprofit open source software development • Promotes use of iRODS technology • Supports standards efforts, intellectual prop. • Coordinates international development efforts • IN2P3 - quota and monitoring system • King’s College London - Shibboleth • Australian Research Collaboration Services - WebDAV • Academia Sinica - SRM interface • More information: http://diceresearch.org

  34. iRODS Wiki • More information… • http://irods.diceresearch.org • Descriptions, tutorials, documentation • Publications / presentations • Download of iRODS open source s.w. • Performance tests • irods-chat page

More Related