440 likes | 553 Views
caGrid Executive Introduction. caGrid 1.3 Justin Permar caGrid Knowledge Center https://cabig-kc.nci.nih.gov/CaGrid/KC. Agenda. Vision and Use Cases caGrid Introduction Building and Using caBIG Applications Component / Service Survey Grid Interactions Grid Service Deployment. Vision.
E N D
caGrid Executive Introduction caGrid 1.3 Justin Permar caGrid Knowledge Center https://cabig-kc.nci.nih.gov/CaGrid/KC
Agenda • Vision and Use Cases • caGrid Introduction • Building and Using caBIG Applications • Component / Service Survey • Grid Interactions • Grid Service Deployment
Vision • “Imagine, if you will, a resource that would give individual scientists the capacity to easily view aggregate information on thousands of patients; a system that would also allow both patients and physicians to have complete medical records - including the patient's personal genome, tests performed over time, and medications taken - available at the click of a mouse. Rather than recruiting patients into clinical trials by who walks into the clinic or by individual referral, clinician-scientists could scan a database for patients precisely matched to their study, even if the study is looking for patients with specific genomic alterations, mutations, or translocations.” • “In efforts to increase both the efficacy and efficiency of cancer care, managers of healthcare systems would have patient outcome data from hospitals across the country to utilize in benchmarking their own outcomes in key areas and managing cost. These brief examples are just a glimpse of the power that could come from such an interconnected national biomedical resource.” Source: John Niederhuber, Director, NCI
About caBIG® • caBIG® stands for the cancer Biomedical Informatics Grid®. caBIG® is an information network enabling all constituencies in the cancer community – researchers, physicians, and patients – to share data and knowledge. The components of caBIG® are widely applicable beyond cancer as well. • The mission of caBIG® is to develop a truly collaborative information network that accelerates the discovery of new approaches for the detection, diagnosis, treatment, and prevention of cancer, ultimately improving patient outcomes. • The goals of caBIG® are to: • Connect scientists and practitioners through a shareable and interoperable infrastructure • Develop standard rules and a common language to more easily share information • Build or adapt tools for collecting, analyzing, integrating, and disseminating information associated with cancer research and care. Source: https://cabig.nci.nih.gov/overview/
Driving needs:cancer Biomedical Informatics Grid • A multitude of “legacy” information systems, most of which cannot be readily shared between institutions • An absence of tools to connect different databases • An absence of common data formats • A huge and growing volume of data must be collected, analyzed, and made accessible • Few common vocabularies, making it difficult, if not impossible, to interlink diverse research and clinical results • Difficulty in identifying and accessing available resources • An absence of information infrastructure to share data within an institution, or among multiple institutions • Avoid redundancy by re-building applications at multiple institutions
What is the Grid? • “Controlled and coordinated resource sharing and problem solving in dynamic, scalable virtual organizations.”1 • Securely sharing (with policies!): • Computers • Software • Data • Other Resources 1The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.
What is caBIG? • Common, widely distributed infrastructure that addresses common caBIG needs and permits the cancer research community to focus on innovation • Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange • Collection of interoperable applications developed to common standards • Cancer research data available for mining and integration
Why Grid for caBIG? Adapted from Muzna Mirza, MD, MSHI’s presentation on Global Public Health Grid: http://cdc.confex.com/cdc/phin2009/webprogram/Paper21091.html
Agenda • Vision and Use Cases • caGrid Introduction • Building and Using caBIG Applications • Component / Service Survey • Grid Interactions • Grid Service Deployment
The “G” in caBIG Cancer Biomedical Informatics Grid Provides the software infrastructure that underlies the tools and applications of caBIG Analogous to the “power grid” A multitude of applications with differing requirements can seamlessly be plugged in to a common infrastructure What is caGrid to caBIG?
What is caGrid? (2) • Biomedical applications that share data all have common needs for syntactic and semantic interoperability • caGrid aims to be a platform for interoperability • caGrid is a Grid software toolkit aimed at software developers creating Grid applications • caGrid provides • the GAARDS toolkit, a standard security platform • metadata services that add semantic information to all Grid services • Introduce, a toolkit to develop Grid services • The Grid is a trusted network that supports collaborative biomedical research. • “Getting on the Grid” involves joining the trusted network by applying for and utilizing Grid credentials
Compatibility and Interoperability caBIG® provides standards-based compatibility guidelines for creating software systems that are syntactically and semantically interoperable.
The Grid Allows Users to Find and Utilize Data and Analytical resources Grid service information is advertised to a Grid service directory called the Index service. This service is used to locate Grid services relevant to your research objectives. Grid (Client Apps, Users) Grid Service Directory (Index Service) Data or Analytical Resource caBIO Grid Service Grid Service discover advertise
caGrid: High Level View Once a caBIG® tool is adopted or adapted by members of the research community, the tool is connected to the Grid to securely share data and analysis routines with collaborating researchers.
Infrastructure Focus Areas • Leveraging Grid technologies and standards as an interoperability platform • Metadata Infrastructure • Surfacing wealth of existing caBIG data-oriented metadata on the grid • Providing new service-oriented metadata • Security • Integrating existing systems and applications with Grid security • Lowering burden of implementation of grid-wide and local policy • Tooling for Service Developers • Powerful platform for bringing applications and data to the grid • Facilitating Grid-wide operations • Federated query, workflow execution, resource discovery • Making the Grid more accessible • Graphical installation and configuration, higher-level object-oriented APIs, web portals, graphical administrative applications • Quality • Comprehensive testing infrastructure, automated builds and test execution on multiple platforms, dashboard with historical archive
More About Security • Comprehensive security is critical for collaboration scenarios involving biomedical data sharing. The caGrid security components, collectively known as GAARDS, include the following services: • Dorian – Allows users to login to the Grid • Authentication Service – Integrates existing institutional login capabilities with the Grid • Grid Grouper – Allows institutions to implement group-based security policies • Grid Trust Service –Provides capabilities for Grid entities to trust each other • Credential Delegation Service – Provides the ability to securely transfer Grid credentials to others • Web Single Sign-On – Allows a single login to provide access to multiple web applications that utilize Grid services
caGrid Integration with Existing Information Systems • caGrid is an informatics platform that integrates and augments existing informatics infrastructure • Examples include the following: • caGrid integrates existing repositories of semantic information such as ontology servers • caGrid integrates with existing institutional login systems (e.g., LDAP) • caGrid shares data from existing databases and files • In summary, caGrid integrates with existing systems to share and analyze data for multi-institutional clinical and research scenarios
Getting Started with caGrid • To get started developing Grid applications, first install caGrid • Use the caGrid installer to load caGrid onto your development machine • Using the installer is the easiest way to install caGrid • Features include: • Guided, wizard-like interface for easy installation • The installer can be used to re-configure existing installations • The only requirement to run the installer is the Sun® Java™ 5 Development Kit.
Agenda • Vision and Use Cases • caGrid Introduction • Building and Using caBIG Applications • Component / Service Survey • Grid Interactions • Grid Service Deployment
caGrid Community Involvement: Building Grid Applications • caGrid itself provides no real “data” or “analysis” to caBIG; caGrid enables the community to build services that share and analyze data • The real “value” of the grid comes from bringing this information to the “end user” • Community members develop end user applications which consume of the resources provided by the grid • A Grid data service shares data securely with collaborators • A Grid analytical service analyzes data • A Grid application utilizes multiple Grid services to aid clinical and research workflows
caCORE Development Process caCORE is a robust set of tools and resources to support the development of caBIG®-compatible systems NCI offers comprehensive training for caCORE tools Create an Information Model using a modeling tool Perform Semantic Integration using the SIW Transform the Model into Metadata using the UML Loader Generate Code and Interfaces using the caCORE SDK Code Generator Generate a Grid Service using caGrid Information Models Vocabularies CDEs APIs Grid Reference: Dr. Robert Freimuth, Vocabulary Knowledge Center Director
UML Model Creation Process Enterprise Vocabulary Services (EVS) Stores controlled terminologies used during semantic annotation The SIW pulls concepts from EVS and attaches them to model components cancer Data Standards Repository (caDSR) Common Data Elements (CDEs) UML model elements that are semantically annotated are added to the caDSR as CDEs Create a Logical Model (UML class diagram) using Enterprise Architect Create a Data Model (database schema) using Enterprise Architect Map the Logical Model to the Data Model using caAdapter Semantically Annotate the UML Model using the SIW Model is complete and ready for compatibility review and load into caDSR Mapping Semantics Load Model Logical Model Data Model
caBIG® Compatibility GuidelinesAreas of Interoperability CDEs APIs Information Models Vocabularies • Semantic Interoperability (VCDE) • Information Models • Vocabularies and Ontologies • Common Data Elements (CDEs) • Syntactic Interoperability (Architecture) • Programming and Messaging Interfaces An application must meet the criteria specified in all four areas to be "caBIG® Compatible" Reference: Dr. Robert Freimuth, Vocabulary Knowledge Center Director
caBIG® Compatibility GuidelinesLevels of Maturity CDEs APIs Information Models Vocabularies • Legacy: Implies no interoperability with an external system or resource • Bronze: Minimum requirements to achieve basic interoperability • Silver: Rigorous requirements to significantly reduce the barrier of use for parties not involved with development of that resource • Gold: Extensions to silver that add standardization and harmonization practices to enable full syntactic and semantic interoperability Source: https://cabig.nci.nih.gov/guidelines_documentation
Agenda • Vision and Use Cases • caGrid Introduction • Building and Using caBIG Applications • Component / Service Survey • Grid Interactions • Grid Service Deployment
caGrid 1.3 Core Services All caGrid Core Services were redeployed on all caBIG® Grids (OSU Training, QA, Stage, and Production) for this release. The (12) caGrid 1.3 Core Services are: * New for 1.3 ** Significantly Rewritten or Enhanced for 1.3
What’s the use of metadata? • Service metadata is critical for finding Grid resources relevant to particular research and clinical scenarios • Metadata describes the service functionality and meaning of data that are shared by a Grid service • Scenario: Scientists and others using the Grid want to find and utilize existing data sources and algorithms relevant to their research scenarios • Solution: Grid services register with a Grid service directory • Scenario: Users want to view the structure and relationships of data on the Grid • Solution: The UML model defines the content of Grid data types and relationships between these types • Scenario: Users need to know the format of the data described in a UML model • Solution: XML schemas, stored in a Grid repository, define the data format to act as the foundation for syntactic interoperability • Scenario: Scientists want to identify the meaning of the data described in a UML model • Solution: Grid data is annotated with semantic information, such as use of community-approved vocabulary and concept definitions
What caGrid services provide this functionality? • Scenario: Scientists and others using the Grid want to find and utilize existing data sources and algorithms relevant to their research scenarios • The Index Service included in caGrid is a Grid-wide service directory that serves as the “white” and “yellow” pages of the Grid • Scenario: Users want to view the structure and relationships of data on the Grid • Every data service provides a data model that represents the information in the UML model • Scenario: Users need to know the format of the data described in a UML model • The Global Model Exchange (GME) Service is a Grid-wide repository for XML schemas • Scenario: Scientists want to identify the meaning of the data described in a UML model • The Metadata Model Service (MMS) is used to add semantic information to caGrid services • The MMS also is used to generate a Grid representation of the data in your UML model, including semantic information
How does caGrid use the caBIG semantic repositories? • All caGrid Services are expected to publish a set of standard metadata which draws heavily from the metadata registered in caDSR and EVS • Common Metadata describes generic information about service providing Cancer Center, points of contact, etc • The Service’s operations are defined and their inputs and outputs described using CDEs in caDSRand vocabulary from EVS • Data Services additionally describe the domain Model they are exposing • Classes, attributes, and associations from the UML model • Semantics of the UML model
What security problems exist for multi-institutional data sharing scenarios? • Inter-institutional “trust” • What institutions participate in the Grid? How can you verify that an identity is issued by an institution (that is claims to be from)? • User authentication • How does a user prove their identity? How can we check that the identity is legitimate? • User authorization • How can institutions that share Grid services grant privileges to their collaborators? • How can institutions that share data ensure their collaborators can only access data that the institutions intend to share? • Data Integrity • How can institutions be sure that data they are sharing is transmitted properly? • Data Security • How can institutions be sure that they share data only with whom they intend to share data? • Allowing services to retrieve and analyze data on your behalf
What caGrid Services Address these Security Scenarios? • Inter-institutional “trust” • The Grid Trust Service (GTS) is used to establish a trust fabric, which is a collection of authoritative certificate authorities • User authentication • Dorian has a CA that is an essential part of the trust fabric • Dorian issues both host certificates and user credentials that are trusted by others in the Grid because they have synchronized with the trust fabric • The Authentication Service allows institutions to integrate their local user management systems with the Grid • User authorization • Grid Grouper provides group management, which in turn, allows service developers to add group-based authorization policies • The Common Security Module (CSM) can be used to protect individual data elements shared by a Grid data service
What caGrid Services Address these Security Scenarios? (2) • Data Integrity • caGrid supports checksums to ensure that data has not been altered during transmissions • Data Security • caGrid supports encryption to ensure that data cannot be read by others during transmission • Allowing services to work for you • The credential delegation service (CDS) allows you to hand your credential to a third party for a specified period of time
How do Grid applications use core caGrid services? • The user community adds data services and analytical services to the Grid • These services share data and analytical resources with others • Multi-institutional collaborations will require the use of multiple Grid services • caGrid provides “higher-level” services that utilize the aforementioned Grid services • The Federated Query Processor (FQP) provides applications with capabilities to aggregate data from multiple (equivalent) data services and to join data from multiple data services • The workflow services allow users to specify interactions between services to achieve a desired result • For example, retrieve all ECG data for subjects in our clinical trial and calculate the mean QT value, storing the data in a results data service
Other caGrid Utilities and APIs • CQL and DCQL • CQL is the “caGrid Query Language” that is used to retrieve data from caGrid data services • DCQL is the distributed query language that is used for federated query processing • Web Single Sign On • The Web Single Sign On component allows users to sign in once and use multiple secure web applications • Introduce • Grid application developers use the Introduce toolkit to create data and analytical services • The Introduce toolkit can be extended to add project-specific functionality
An example Introduce development process (0 lines of developer code!) Create Semantically Harmonized Data Model Generate Data Resource Grid-ify
Agenda • Vision and Use Cases • caGrid Introduction • Building and Using caBIG Applications • Component / Service Survey • Grid Interactions • Grid Service Deployment
Grid Workflows utilize core Grid Services • The Grid services that are included in caGrid provide a core set of features available for Grid usage scenarios • Grid workflows are software implementations of real-life clinical and research workflows Figure: Example Data Analysis Workflow
Example Image Analysis Scenario Each image processing step is a Grid service Each step in background correction is an operation Source: Joel H. Saltz, Scott Oster, Shannon L. Hastings, Stephen Langella, Renato A. Ferreira, Justin D. Permar, Ashish Sharma, David W. Ervin, Tony C. Pan, Umit V. Catalyurek, Tahsin M. Kurc, "Translational research design templates, Grid computing, and HPC", IEEE International Symposium on Parallel and Distributed Processing., : pp. 1-15, June, 2008. http://bmi.osu.edu/publications_more.php?ID=1113
Agenda • Vision and Use Cases • caGrid Introduction • Building and Using caBIG Applications • Component / Service Survey • Grid Interactions • Grid Service Deployment
Joining the Grid • During Grid service creation, the service creator specifies the authentication and authorization requirements for the service • For example, a service can require that users must authenticate with the service in order to communicate • Specify authorization options (CSM/Grid Grouper) that are needed to support data retrieval and analysis operations that the service offers. A service can require authorization at the service level, operation level, and data level (give the user permission to retrieve only what they are allowed to view) • Configure a container to host the service • Two types of containers: secure and non-secure • A non-secure container can only host non-secure services and does not support authentication or authorization • A secure container can host secure and non-secure services and will support authentication and authorization as specified by the service • A secure container has its own identity that it uses to communicate with the rest of the Grid • Deploy the service to the container and start the container • The service advertises itself to the Grid service directory • The service directory, in turn, asks your service for information about its operations and data
The Role of Grid Policy • The virtual organizations that join a Grid collectively establish (and enforce) policies that govern the use of the Grid • Security policies • How long can a user Grid session last? • Data sharing policies • Sharing de-identified data? Limited data sets? PHI? • Service level agreements • What requirements are imposed on service providers? • Other domain-specific policies
Project Resources and Communication • cagrid.org • Software Downloads • Documentation • Tutorials • Technical Paper and Presentations • FAQs • caBIG® caGrid Knowledge Center • Knowledge Base • Forums • Enterprise Support • Community engagement • https://cabig-kc.nci.nih.gov/CaGrid/KC/index.php/Main_Page • caGrid GForge Home (project website) • Feature Requests • Bug Reports • http://gforge.nci.nih.gov/projects/cagrid-1-0/ • caGrid Portal (web portal) • http://cagrid-portal.nci.nih.gov/
Acknowledgments • THANK YOU • caGrid Development team • caBIG® Documentation and Training team