590 likes | 716 Views
The ca ncer B iomedical I nformatics G rid: Connecting the Cancer Research Community. Scott Oster Department of Biomedical Informatics Ohio State University Challenges of Large Applications in Distributed Environments (CLADE) 2007 Monterey Bay, California June 25, 2007. Agenda.
E N D
The cancer Biomedical Informatics Grid:Connecting the Cancer Research Community Scott Oster Department of Biomedical Informatics Ohio State University Challenges of Large Applications in Distributed Environments (CLADE) 2007 Monterey Bay, CaliforniaJune 25, 2007
Agenda • caBIG Overview • caGrid • Challenges of caBIG
Cancer Background • This year there will be approximately 1,400,000 Americans diagnosed with cancer • More than 500,000 Americans are expected to die from cancer this year • In 2005, the NIH estimated costs for cancer at $209.9 billion, with direct medical costs of $74 billion
National Cancer Institute 2015 Goal “Relieve suffering and death due to cancer by the year 2015”
Origins of caBIG • Goal:Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal. • Strategy:Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network
caBIG Community • More than 50 Cancer Centers (of 61 total) • 30 Organizations • Government, Industry, Standards • Over 800 people
caBIG Domain Workspaces • The data and tool producers: • Clinical Trial Management Systems • Provides software tools for consistent, open, and comprehensive clinical trials management, including enrollment of patients, tracking of protocols, recording of outcomes information, administration of trials, and submission of data to regulatory authorities • Integrative Cancer Research • Builds software tools and systems to enable integration of clinical information (such as data collected from biospecimen donors) with molecular information (such as data from high-throughput genomic and proteomic technologies) • In Vivo Imaging • Provides technology for the sharing and analysis of in vivo (in the body) imaging data, such as MRI and PET scans, both in basic and clinical research settings • Tissue Banks and Pathology Tools • Develops software tools for the collection, processing, and dissemination of biospecimens, including the annotation of those biospecimens with donor clinical and protocol data, as well as for the operational and administrative aspects of biorepositories
caBIG Strategic Workspaces • The policy makers: • Data Sharing and Intellectual Capital • Develops policies for the sharing of data, software, and inventions within the caBIG™-funded cancer community. This workspace addresses, for example, how to implement patient protection policies; the ethical, legal, and contractual obligations associated with the sharing of clinical data and biospecimens; and how the public and private sector should interact when using caBIG™ tools in collaboration • Documentation and Training • Provides technical training for software developers in the use of the caBIG™ resources, including online tutorials, workshops, and education programs • Strategic Planning • Assists in identifying strategic priorities for the development and evolution of caBIG™
caBIG Cross-Cutting Workspaces • The infrastructure and standards developers: • Architecture • Develops communication standards and systems necessary forall other caBIG™ workspaces to inter-connect as a grid via the Internet, including solutions for access control, security, and patient data protection • Vocabularies and Common Data Elements • Creates data standards, including the development, promotion, and support of vocabularies, ontologies, and common data elements to ensure that the entire caBIG™ community is speaking the same “language.” Such common data standards are a key component to ensure that large scale NCI projects generate interoperable information
What is caBIG? • Common, widely distributed infrastructure that permits the cancer research community to focus on innovation • Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange • Collection of interoperable applications developed to common standards • Cancer research data available for mining and integration
Driving needs • A multitude of “legacy” information systems, most of which cannot be readily shared between institutions • An absence of tools to connect different databases • An absence of common data formats • A huge and growing volume of data must be collected, analyzed, and made accessible • Few common vocabularies, making it difficult, if not impossible, to interlink diverse research and clinical results • Difficulty in identifying and accessing available resources • An absence of information infrastructure to share data within an institution, or among different institutions
What is caGrid? • Development project of Architecture Workspace • The Grid infrastructure for caBIG (the “G” in caBIG) • Driven from use cases and needs of cancer research community • Service Oriented Architecture • Based on federation • Model Driven • Object-Oriented, Semantically-Annotated Data Virtualization
What is caGrid? cont… • Builds on existing Grid technologies • Provides additional enterprise Grid components • Grid Service Graphical Development Toolkit • Metadata Infrastructure • Advertisement and Discovery • Semantic Services • Data Service Infrastructure • Analytical Service Infrastructure • Identifiers • Workflow • Security Infrastructure • Client tooling
Agenda • caBIG Overview • caGrid • Challenges of caBIG
Issue: Disparate systems • No common infrastructure for applications, databases, etc • Variety of programming languages • Variety of platforms and operating systems • Inability to interoperate with other systems throughout virtual organization
Approach: Disparate systems • Create and leverage a standards-based Grid (caGrid) • WSRF web services using SOAP/HTTP(s) • Creation of compatibility guidelines and review process • Define a uniform query interface and language for data providing systems • Provide common infrastructure services most federation scenarios • Focus on tools for virtualizing existing systems and APIs behind these grid interfaces • Open Issue: some systems require more manual work than others • Open Issue: tradeoff between specificity and universal applicability
Introduce • Graphical Development Environment for Grid Services • Provides simple means to create a service skeleton that a developer can then implement, build, and deploy • Provides a set of tools which enable the developer to add/remove/modify/import methods of the service • Automatic code generation (WSDL, service and client APIs, JNDI, WSDDs, security descriptors, metadata, etc)
Issue: Lack of common Data Formats • Tools use widely varying and/or proprietary data formats • Lack of formal definition • Not all suitable for communication with remote systems • Lack of uniform way to discover and understand the formats
Approach: Lack of common Data Formats • Adopt XML as data exchange format • Leverage XML Schemas for definition • Global Model Exchange service for publishing, managing, and discovering XML Schemas • Leverage UML for logical definition of data models • Cancer Data Standards Repository (caDSR) captures logical model with annotations; facilitates reuse and formal definition • Formal binding of logical model (UML) and exchange model (XML) • Community review of the use of standards for new systems • Open Issue: Data translation still necessary when existing system can’t be easily changed (though some caBIG tools exist to address this; e.g caAdapter) • Open Issue: tradeoff between reuse and creating the new “perfect model”
Issue: Data Interoperability • Common data formats allow for syntactic data interoperability but are not sufficient for ensuring common semantics • May work with wholesale adoption of common domain-specific models, but breaks down cross-model • Need to understand the meaning of the value domains and terminology of a data format or system • Assumptions of meaning can be dangerous, even deadly, in the medical domain
Semanticinteroperability Syntacticinteroperability Interoperability • The ability of multiple systems to exchange information and to be able to use the information that has been exchanged.
Semantics Example • <Agent> • <name>Taxol</name> • <nSCNumber>007</nSCNumber> • </Agent>
Approach: Data Interoperability • Community maintained and curated shared ontology • Enterprise Vocabulary Services (EVS) maintains and provides access to the data semantics and controlled vocabulary of all models • Definitions, synonyms, relationships, etc • All models in caDSR annotated with terminology and concepts from EVS • Focus on identifying “Common Data Elements” as semantically equivalent attributes • Based on ISO 11179 Information Technology – Metadata Registries (MDR) parts 1-6 • Community review of the use of standards and harmonization for new systems • Open Issue: Is it possible to scale to federated terminologies? • Open Issue: High initial cost to entry; high overhead to maintaining quality
Client and service APIs are object oriented, and operate over well-defined and curated data types Objects are defined in UML and converted into ISO/IEC 11179 Administered Components, which are in turn registered in the Cancer Data Standards Repository (caDSR) Object definitions draw from controlled terminology and vocabulary registered in the Enterprise Vocabulary Services (EVS), and their relationships are thus semantically described XML serialization of objects adhere to XML schemas registered in the Global Model Exchange (GME) caGrid Data Description Infrastructure
Issue: Finding Resources • Creating infrastructure for programmatic interoperability is excessive without a way to dynamically find and use previously unknown resources • Resources need to be self-descriptive enough such that their use and value can be determined
Approach: Finding Resources • Rich set of standardized metadata publicly provided by each service • Operations and data types described in terms of structure and semantics extracted from caDSR and EVS • Services register existence with Index Service, and metadata is aggregated • Tools for querying Index Service, and analyzing metadata are provided • Open Issue: Lines between data and metadata are blurry at best • Some key distinctions in caBIG are metadata is publically accessible, and describes "types" not instances
Advertisement and Discovery Process • All services register their service location and metadata information to an Index Service • The Index Service subscribes to the standardized metadata and aggregates their contents • Clients can discover services using a discovery API which facilitates inspection of data types • Leveraging semantic information (from which service metadata is drawn), services can be discovered by the semantics of their data types • “Find me all the services from Cancer Center X” • “Which Analytical services take Genes as input?” • “Which Data services expose data relating to lungcancer?”
Time Issue: Data Size • Numerous Sources of Large Data sets • Imaging • Tumor Microenvironment • High Resolution Scanning= 25TB/cm2 tissue • Image repositories • Multiple Modalities, thousands of cases, Millions of images, terabytes of data • Mouse Models • terabytes of data • Proteomics • Modest Example: • 30 samples, 10 fractions, 10 runs, 1.5 MB per spectra = 4.5 GB • Many others
Approach: Data Size • Often a tradeoff between optimized performance and interoperability • e.g. Out of band binary transfer vs XML/SOAP/HTTP • Currently Leveraging: • Transfer: ws-enumeration, GridFTP (with integrated security, and metadata) • Avoid Transfer: Identifiers, federated query, workflow, co-location • Looking at: • Moving services to data (Imaging) • Binary data format descriptions for binary metadata (e.g DFDL) • New area of address; much more to do…
Issue: User Accounting • Most legacy systems built with local users and permissions • can’t require users to maintain hundreds of accounts, but need to still allow local policy • Central account management and identity vetting not tractable • but there are too many organizations with differing infrastructures to try to establish point to point relationships
Approach: User Accounting • Provide Single Sign On to grid via X.509 proxy certificates • Grid Authentication and Authorization with Reliably Distributed Services (GAARDS) • Federate Identity Management (Dorian) • Rely on participating institutions to vouch for identity of their members • Standardize on identity assertion language and attributes • Integrate existing institutional identity management systems as Registration Authorities, into aggregate Certificate Authorities • Distribute revocations via Grid Trust Service (GTS); discussed later
GAARDS in Action User authenticates to local credential provider using your everyday user credentials Authenticate with Local Credential Provider SAML Assertion
GAARDS in Action Application obtains grid credentials from Dorian using SAML provided by the local provider. SAML Assertion Grid Credentials
GAARDS in Action Application uses grid credentials to invoke secure grid services. Grid Credentials
Issue: Data Privacy • Lots of interesting data involves human subjects in some form • Numerous barriers to data and resource sharing in caBIG • Federal, state, and local law; regulations; institutional policies • Institutional Review Boards (IRB) involved for any protected health information (PHI); even for de-identified data • Grid is new technology; IRBs must give very detailed protocol approvals • Most regulations are more than just "who“; "how" and "for what" matters • Grid is multi-institutional which means IRBs must reach agreements (read: separately employed lawyers working together) • Legal and policy requirements related to privacy and security drivers include: • HIPAA Privacy and Security Rules • The Common Rule for Human Subjects Research • FDA Regulations on Human Subjects • 21 CFR Part 11 • State and institutional requirements
Approach: Data Privacy • Though some aspect of solutions require technology (auditing, provenance, encryption/digital signing), the problem cannot be solved by technology alone • Data Sharing and Intellectual Capital Workspace (DSIC) • Identification of issues; development of guidelines; template agreements; education and training • Some caBIG (and external) tools exist for automated de-identification • Can leverage authorization solutions (GridGrouper for group-based; CSM for local policy; Globus PDPs for complex rules) • Open Issue: What technologies and policies (if any) can be universally adopted? • Open Issue: To date emphasis of development security infrastructure in caBIG has been around services; not data • Lots of work to do…
Issue: Intellectual Captial • Social problem • “Publish or perish” • Justified hesitance to share pre-publication data • Justified reluctance to advance the cause of competitors (industrial and academic) • Can I rely on the data/results of some (potentially) unknown entity? • If cancer is cured, and caBIG resources play a role, there will be much interest in knowing who contributed what (and who funded them) • Proper attribution is not just ethical, its often required
Approach: Intellectual Captial • Technological • Provenance may or may not be enough (annotation vs enforcement) • Socio-Cultural • Whole workspace in caBIG dedicated to it (DSIC) • NCI in a good position to “encourage” it • Large percentage of institutions’ cancer research funding comes from NCI • Hope is motivation will be value-based once initially primed • Starting to see movement from “wait and see” to active engagement; industry involvement • Lots of work to do…
Issue: Complicated Trust Arrangements • When hundreds of organizations are sharing data and providing access to each other’s systems, defining a trust model is complicated, even for public data • For non-public data/systems, the simplest/safest policy is “deny all” • For many data sets and services, the owning organization may be virtual • Central authority is socially and technologically intractable • Rapid propagation of information on compromised systems/individuals is critical
Approach: Complicated Trust Arrangements • Grid Authentication and Authorization with Reliably Distributed Services (GAARDS) • Federated Trust Models (GTS) • Establish and manage trust relationships between institutions through adherence to mutually agreed upon policy • Promote global policy distribution, but allow arbitrary local overrides • Provide enterprise tools and services for management and automate distribution of information
Grid Trust Service (GTS) Federation • A GTS can inherit Trusted Authorities and Trust Levels from other Grid Trust Services • Allows one to build a scalable Trust Fabric • Allows institutions to stand up their own GTS, inheriting all the trusted authorities in the wider grid, yet being to add their own authorities that might not yet be trusted by the wider grid • A GTS can also be used to join the trust fabrics of two or more grids
GAARDS in Action Application uses grid credentials to invoke secure grid services. Grid Credentials
GAARDS in Action Should I trust the credential signer? Grid Service authenticates the user by asking the GTS whether or not the signer of the credential should be trusted.
Issue: Computationally Expensive • Many studies on molecular data require expensive calculations on large data sets • Statistical analysis, hypothesis testing, searches • Researchers lack necessary computing resources
Approach: Computationally Expensive • Variety of well-known solutions exist in Grid and cluster space (a main driving force of their existence) • Challenge is in seamlessly integrating with abstraction layer in use • i.e Operations on semantically annotated objects, not scheduled jobs on flat files • Leverage virtualization; domain specific service interface over general computational resources • TeraGrid, Super Computer Centers • Open Issue: Balancing abstraction vs control (e.g. scheduling priorities, cost models, optimizations, etc) • Open Issue: Appropriate level of control for service as resource broker • Open Issue: Complexity moved from client to service developer (working on tools to facilitate)
Issue: Evolving Infrastructure • Standards in Web/Grid service domain are turbulent at best • Competing interests of “big business” and multiple standards bodies • Major revisions of toolkits generally not backwards compatible • Interface stability vs new features • Don’t want multiple grids • Upgrade or perish? Staying behind means lack of support • Application layer abstractions help developers, but don’t address “wire incompatibility”
Approach: Evolving Infrastructure • Most traditional solutions are in conflict with strongly-typed requirements or complicate service development (unless extensibility built into spec) • e.g. Lax processing; must ignore/must understand with schema overloading; multiple (protocol) service interfaces • Abstract specifications from developers with tooling • Focus on rigid “data format” specifications, allow more freedom on composition into messages • Open Issue: Doesn’t address wire incompatibility • Open Issue: No good solution • Do we need to just get it “good enough” and stabilize?