180 likes | 429 Views
Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003. Why Discovery Net?. Data Challenge:
E N D
Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003
Why Discovery Net? • Data Challenge: • Distributed, heterogeneous & large scale data sets • Novel and real-time data sources • Resource Challenge • Novel specialised data analysis components/services continually being published/made available • Computational resources provided • Information Challenge: • Data cleaning, normalisation & calibration • New data needs to be related to existing data • Knowledge Challenge: • Collaborative, interactive & people-intensive • Result interpretation & validation in relation to existing knowledge • Knowledge sharing is key
What is Discovery Net • Goal : • Construct an Infrastructure for Global wide Knowledge Discovery Services • Key Technologies: • Grid and Distributed Computing • Workflow and service composition • Data Mining & Visualisation. • Data Access & Information Structuring. • High Throughput Screening Devices: real-time.
Discovery Net: Unifying the World’s Knowledge • Data Integration: • Dynamic Real Time Construction of “Data Grids” • Application Integration: • Component and Service-based Integration • People Integration: • Global-wide Discovery Groupware • Knowledge Integration: • Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work
Scientific Information Scientific Discovery Real Time Integration Workflow Construction Literature Databases Operational Data Interactive Visual Analysis Dynamic Application Integration Using Distributed Resources Images Instrument Data What is Discovery Net
Discovery Net Layer Model(Life Science Application) D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities Deployment Web/Grid Services OGSA D-Net Middleware: Provides execution logic for distributed knowledge discovery and access to distributed resources High Performance and Grid-enabled Transfer Protocol (GSI-FTP, DSTP..) Grid-enabled Infrastructure (GSI) Computation & Data Resources: Distributed databases, compute servers and scientific devices.
Goal: Plug & Play • Data Sources, Analysis Components and Knowledge Discovery Processes Several types of clients for different usage (from thin web client to participating client) Current implmentation based on Java distributed objects (EJB), moving towards Web/Grid service But deployment and API access through standard Web/Grid service A Knowledge Grid based on D-Net Servers
D-Net Workflow for Genome Annotation : 16 services executing across Internet Discovery Process Management • Workflow based service composition • Data-flow approach fits Knowledge Discovery process • Allows scientists to develop processes. • Towards a Standard Workflow Representation for Discovery Informatics: Discovery Process Markup Language (DPML): • Contains component data-flow graphs, but also • Records collaboration information (user, changes) • Records execution constraints (location, parameterisation) • Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms
Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring • Towards a Dynamic Information Integration Methodology: • Specialised Information Source Access: InfoGrid allows users to register, locate and connect to various specialised information sources. • On the-fly Integration: InfoGrid allows users to build their own integration structure on the fly (Worst case: proprietary protocol/format, best case JDBC/HTTP-XML-XPath/Web Service). • Easy Maintenance: Wrappers/Drivers to new data sources can be added through a clean API TrailsPatients… Clinical Journals Project Reports Patents… Activity Protocols Toxicology Metabolic Pathways… BiologicalScreening Journals Integrative Analysis Sequence Structure Location Function… Structures Libraries Catalogues Synthetic pathways… Protein /Targets Chemistry Gene Sequence Expression Function… InfoGrid: Dynamic Data Integration
Dynamic Application Integration = On-demand access and composition of remote analysis components • Towards a Dynamic Component Integration: • Component service: allow users to register, locate and remotely execute components (Java component interface or Web Service port type). • Execution service: allow users to control the execution of components distributed environments • Easy Maintenance: New components can be added through a clean API Dynamic Application Integration Services Clustering Classification Regression Gene function perdition D-NET API Promoter Prediction Homology Search
Discovery Deployment = On-demand rapid application construction and publishing Towards a Dynamic Deployment of Knowledge Discovery Procedures: • Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool. • Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures • Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports. Discovery Deployment Discovery Component Report Discovery Process in DPML Batch processing Discovery Service
Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge • Towards a Knowledge Integration Framework: Multi-subject data analysis • Specialised Client Interfaces: Interactive Analysis and dynamic component interaction • Result Annotation, Structuring and Storage: Information source query, result browsing, sharing and markup Genetic Analysis Text Mining Sequence Analysis PathwayAnalysis Knowledge Integration & Interpretation Life science example application
Workflow execution • Component execution location resolution • User list of known resources • A component can require explicitly to be executed on a particular resource • A component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go) • For unconstrained components, simple “near the data” execution policy: • If single input data location then execute there • Otherwise fallback to original execution location • Allows usual DPKD workflows to be designed • Handles data management and transfer (serialisation, Java based, FTP based)
Discovery Net and Grid technologies • Cluster/Campus Grid level: • Partial or complete workflow execution on Condor / SGE • Task farming on subset of the workflow • Global Grid: • GSI integration (Java Cog Kit) • GSI-FTP transfer functionality (Java Cog Kit) • OGSA Grid Service access to functionalities (GT3) • Potential use of GRIS or NWS in component implementation • Globus scheduler ? Unicore ? SRB ?
Discovery Net Application Testbeds GUSTO UNITS with wireless connectivity • Life Science Testbed: • Gene sequencing, Protein Chips • High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases • Environmental Modelling • Pollution Sensors (GUSTO): SO2, Benzene, .. • High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets • Geo-hazard Prediction • Multi-spectral, multi-temporal, Satellite imagery • Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge
Organism Chromosomes Organism’s DNA Identify High Throughput Sequencers Genes tRNAs, rRNAs Non-translated RNAs genscan blast EMBL NCBI Gene markers Repeat Masker Regulatory Regions Repetitive Elements grail Nucleotide-level Annotation TIGR SNP Segmental Duplication SNP Variations E-PCR genscan Literature References ….. Classify into Protein Families Proteins blast 3D-PSSM Inter Pro Inter Pro Functional Characteisation Protein-level Annotation Homologues Motif Search SWISS PROT PFAM Domain 3-D Structure SMART Secondary structure Fold Prediction predator DSC Literature References ….. Pathway Maps Relate Cell Cycle Ontologies Process-level Annotation Metabolism GO CSNDB Biological Process….. AmiGO GeneMaps Drugs Cell death virtual chip Embryogenesis KEGG GK GenNav Literature References ….. Case Study:SC2002 HPC Challenge D-Net based Global Collaborative Real- Time Genome Annotation Identify Identify Genome Annotation 15 DBs 21 Applications
Interactive Editor & Visualisation Nucleotide Annotation Workflows Download sequence from Reference Server Save to Distributed AnnotationServer Inter Pro KEGG SMART SWISS PROT EMBL NCBI TIGR SNP GO Execute distributed annotation workflow • 1800 clicks • 500 Web access • 200 copy/paste • 3 weeks work in 1 workflow and few second execution How It Works
Conclusion and Future works • Towards an open integration platform that enables scientists to conduct their KD activities • Several levels of integration required • Enable use of available resources • Evolution towards cost model integration (performance, value, QoS) • Semantic based service retrieval and composition • Other useful standards ? (OGSA-DAI ?)