Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services

Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003

Why Discovery Net? • Data Challenge: • Distributed, heterogeneous & large scale data sets • Novel and real-time data sources • Resource Challenge • Novel specialised data analysis components/services continually being published/made available • Computational resources provided • Information Challenge: • Data cleaning, normalisation & calibration • New data needs to be related to existing data • Knowledge Challenge: • Collaborative, interactive & people-intensive • Result interpretation & validation in relation to existing knowledge • Knowledge sharing is key

What is Discovery Net • Goal : • Construct an Infrastructure for Global wide Knowledge Discovery Services • Key Technologies: • Grid and Distributed Computing • Workflow and service composition • Data Mining & Visualisation. • Data Access & Information Structuring. • High Throughput Screening Devices: real-time.

Discovery Net: Unifying the World’s Knowledge • Data Integration: • Dynamic Real Time Construction of “Data Grids” • Application Integration: • Component and Service-based Integration • People Integration: • Global-wide Discovery Groupware • Knowledge Integration: • Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work

Scientific Information Scientific Discovery Real Time Integration Workflow Construction Literature Databases Operational Data Interactive Visual Analysis Dynamic Application Integration Using Distributed Resources Images Instrument Data What is Discovery Net

Discovery Net Layer Model(Life Science Application) D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities Deployment Web/Grid Services OGSA D-Net Middleware: Provides execution logic for distributed knowledge discovery and access to distributed resources High Performance and Grid-enabled Transfer Protocol (GSI-FTP, DSTP..) Grid-enabled Infrastructure (GSI) Computation & Data Resources: Distributed databases, compute servers and scientific devices.

Goal: Plug & Play • Data Sources, Analysis Components and Knowledge Discovery Processes Several types of clients for different usage (from thin web client to participating client) Current implmentation based on Java distributed objects (EJB), moving towards Web/Grid service But deployment and API access through standard Web/Grid service A Knowledge Grid based on D-Net Servers

D-Net Workflow for Genome Annotation : 16 services executing across Internet Discovery Process Management • Workflow based service composition • Data-flow approach fits Knowledge Discovery process • Allows scientists to develop processes. • Towards a Standard Workflow Representation for Discovery Informatics: Discovery Process Markup Language (DPML): • Contains component data-flow graphs, but also • Records collaboration information (user, changes) • Records execution constraints (location, parameterisation) • Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms

Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring • Towards a Dynamic Information Integration Methodology: • Specialised Information Source Access: InfoGrid allows users to register, locate and connect to various specialised information sources. • On the-fly Integration: InfoGrid allows users to build their own integration structure on the fly (Worst case: proprietary protocol/format, best case JDBC/HTTP-XML-XPath/Web Service). • Easy Maintenance: Wrappers/Drivers to new data sources can be added through a clean API TrailsPatients… Clinical Journals Project Reports Patents… Activity Protocols Toxicology Metabolic Pathways… BiologicalScreening Journals Integrative Analysis Sequence Structure Location Function… Structures Libraries Catalogues Synthetic pathways… Protein /Targets Chemistry Gene Sequence Expression Function… InfoGrid: Dynamic Data Integration

Dynamic Application Integration = On-demand access and composition of remote analysis components • Towards a Dynamic Component Integration: • Component service: allow users to register, locate and remotely execute components (Java component interface or Web Service port type). • Execution service: allow users to control the execution of components distributed environments • Easy Maintenance: New components can be added through a clean API Dynamic Application Integration Services Clustering Classification Regression Gene function perdition D-NET API Promoter Prediction Homology Search

Discovery Deployment = On-demand rapid application construction and publishing Towards a Dynamic Deployment of Knowledge Discovery Procedures: • Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool. • Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures • Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports. Discovery Deployment Discovery Component Report Discovery Process in DPML Batch processing Discovery Service

Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge • Towards a Knowledge Integration Framework: Multi-subject data analysis • Specialised Client Interfaces: Interactive Analysis and dynamic component interaction • Result Annotation, Structuring and Storage: Information source query, result browsing, sharing and markup Genetic Analysis Text Mining Sequence Analysis PathwayAnalysis Knowledge Integration & Interpretation Life science example application

Workflow execution • Component execution location resolution • User list of known resources • A component can require explicitly to be executed on a particular resource • A component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go) • For unconstrained components, simple “near the data” execution policy: • If single input data location then execute there • Otherwise fallback to original execution location • Allows usual DPKD workflows to be designed • Handles data management and transfer (serialisation, Java based, FTP based)

Discovery Net and Grid technologies • Cluster/Campus Grid level: • Partial or complete workflow execution on Condor / SGE • Task farming on subset of the workflow • Global Grid: • GSI integration (Java Cog Kit) • GSI-FTP transfer functionality (Java Cog Kit) • OGSA Grid Service access to functionalities (GT3) • Potential use of GRIS or NWS in component implementation • Globus scheduler ? Unicore ? SRB ?

Discovery Net Application Testbeds GUSTO UNITS with wireless connectivity • Life Science Testbed: • Gene sequencing, Protein Chips • High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases • Environmental Modelling • Pollution Sensors (GUSTO): SO2, Benzene, .. • High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets • Geo-hazard Prediction • Multi-spectral, multi-temporal, Satellite imagery • Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge

Organism Chromosomes Organism’s DNA Identify High Throughput Sequencers Genes tRNAs, rRNAs Non-translated RNAs genscan blast EMBL NCBI Gene markers Repeat Masker Regulatory Regions Repetitive Elements grail Nucleotide-level Annotation TIGR SNP Segmental Duplication SNP Variations E-PCR genscan Literature References ….. Classify into Protein Families Proteins blast 3D-PSSM Inter Pro Inter Pro Functional Characteisation Protein-level Annotation Homologues Motif Search SWISS PROT PFAM Domain 3-D Structure SMART Secondary structure Fold Prediction predator DSC Literature References ….. Pathway Maps Relate Cell Cycle Ontologies Process-level Annotation Metabolism GO CSNDB Biological Process….. AmiGO GeneMaps Drugs Cell death virtual chip Embryogenesis KEGG GK GenNav Literature References ….. Case Study:SC2002 HPC Challenge D-Net based Global Collaborative Real- Time Genome Annotation Identify Identify Genome Annotation 15 DBs 21 Applications

Interactive Editor & Visualisation Nucleotide Annotation Workflows Download sequence from Reference Server Save to Distributed AnnotationServer Inter Pro KEGG SMART SWISS PROT EMBL NCBI TIGR SNP GO Execute distributed annotation workflow • 1800 clicks • 500 Web access • 200 copy/paste • 3 weeks work in 1 workflow and few second execution How It Works

Conclusion and Future works • Towards an open integration platform that enables scientists to conduct their KD activities • Several levels of integration required • Enable use of available resources • Evolution towards cost model integration (performance, value, QoS) • Semantic based service retrieval and composition • Other useful standards ? (OGSA-DAI ?)

Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services

Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services

Presentation Transcript

Welcome to Discovery Education Science for Elementary

Discovery Education ThinkLink Assessment

Discovery and Expansion (1450-1650)

Psychology A Discovery Experience

SURVIVING E-DISCOVERY

SNP Discovery and Analysis: Application to Association Studies

Livestock Marketing

Brief History of Modern Science

Journey to Discovery

The Web of Science and Citation Indexing

Motif search and motif discovery

The Discovery of America

Lesson 2 – Discovery of a Father

Gamma-Ray Bursts: Recent Progress and Relation with Cosmology Dai Zigao Nanjing University

Concept 11.1

Modern Methods in Drug Discovery

Computer Aid Discovery Course: Molecular Classification of Cancer

Road to Discovery: Lecture 3

The Science of Chemistry

SCM 330 Ocean Discovery through Technology

Mass Spectrometry as the Premier Analytical Tool in Drug Discovery and Drug Development