Distributed Data Mining in Discovery Net

Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London

What is Discovery Net Distributed Data Mining for Compute Intensive Tasks Distributed Data Mining for Sensor Grids Knowledge Discovery from Naturally Distributed Data Sources What Do Scientists Really Want?

1. What is Discovery Net

Funding : One of the eight UK national e-science Pilot Projects funded by EPSRC (£2.2M) Start Oct 2001, End March 2005 Goal :Construct the World’s first Infrastructure for Global Knowledge Discovery Services Key Technologies: Open Service Computing High Throughput Devices and Real Time Data Mining Real Time Data Integration & Information Structuring Cross Domain Knowledge Discovery and Management Discovery Workflow and Discovery Planning What is Discovery Net?

Life Sciences High throughput genomics and proteomics Distributed Databases and Applications Environmental Modelling High throughput dispersed air sensing technology Sensor Grids Real time geo-hazard modelling Earthquake modelling through satellite imagery High performance Distributed Computation 10 9 8 7 6 5 4 3 2 1 M N A B C D E F G H I J K L Discovery Net Applications

Goal: Plug & Play Data Sources, Analysis Components & Knowledge Discovery Processes Discovery Net Architecture DPML Web/Grid Services OGSA D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides services and execution logic for distributed knowledge discovery and access to distributed resources and services Computation & Data Resources: Distributed databases, compute servers and scientific devices. High Performance Communication Protocol (GridFTP, DSTP..) Grid Infrastructure (GSI)

Generic Data Mining Classification, Clustering, Associations, .. Unstructured-Data Mining Text Mining, Image Mining Domain-specific Mining Bioinformatics, Cheminformatics, .. Discovery Net Data Mining Components

2. Distribution of Compute Intensive Tasks a. Distributed Data Mining for Geo-hazard Prediction

Data Warehousing & Modelling Co-registration & geo-rectification Image features extraction Cluster & classification Grid-based Geo-hazard Data Mining • Grid-based HPC Computation • Workflow to Co-ordinate Grid Computation • Automatically co-register a stack of imagery layers at high precision and speed. • Grid-based Data Access and Integration

Image “before” Image “after” Reading Data set Reading Data set Setting search window Setting comparing window Setting comparing window Significant correlation coefficient N Y Delta X Delta X Correlation coefficient Normalised cross-correlation (NCC) template algorithm Operating on a remotely accessed MPI UNIX parallel computer through fast network with DNet interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID.

2. Distribution of Compute Intensive Tasks b. Distributed Clustering

Workflows for Distributed Data Clustering

3. Distributed Mining over Sensor Grid Data Distributed Spatial Data Mining for Air Pollution Modelling

Sensor Specification The GUSTOProject - Update(Generic UV Sensors Technologies & Observations) • High throughput open path spectrometer system • Robustalgorithm for pollutant concentration retrievals • Measures SO2, NO, NO2,O3 & Benzene to ppb levels every few seconds • Geared for networking of multiple GUSTO units within a GRID Infrastructure • Can support Remote Sensing data for (contour) mapping of pollutants www.gusto-systems.com

GUSTOunit1 Wireless connectivity Monitoring and control software Sensor registry & control service GUSTOunit2 SensorML HTTP, SOAP, GSI HTTP, SOAP, GSI Data access service Warehouse Data upload service Public access Web visualizer GUSTOunit3 Archived weather data Visualisation and Data Mining Archived health data GUSTOunit4 GRID Infrastructure Networking of Multiple GUSTO Units www.gusto-systems.com

Pollution analysis

4. Knowledge Discovery from Naturally Distributed Data Sources Distributed Data Mining in Life Sciences

secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments receptors signals pathways ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT linkage maps cytogenetic maps physical maps Distributed Data Mining for Life Sciences

Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose. Design an integration schema? Information Integration Gene Expression Warehouse ExPASy SwissProt PDB ExPASy Enzyme OMIM Enzyme Disease Protein Affy Fragment LocusLink Known Gene MGD Sequence Pathway SNP Metabolite SPAD Sequence Cluster Genbank KEGG NCBI dbSNP NMR UniGene

From Data Integration to Knowledge Unification In Silico Experiment D-World I-World K-World

Identify Organism Chromosomes Organism’s DNA Identify Genes tRNAs, rRNAs High Throughput Sequencers Gene markers Non-translated RNAs EMBL NCBI genscan blast Regulatory Regions Repetitive Elements TIGR SNP grail Repeat Masker Segmental Duplication SNP Variations Nucleotide-level Annotation Literature References ….. E-PCR genscan Identify Proteins Classify into Protein Families Inter Pro Inter Pro blast 3D-PSSM Functional Characteisation Homologues SMART SWISS PROT Domain 3-D Structure PFAM Motif Search Protein-level Annotation Fold Prediction Secondary structure predator DSC Literature References ….. Relate Pathway Maps Ontologies Cell Cycle Metabolism GO CSNDB Process-level Annotation Drugs Biological Process….. AmiGO GeneMaps KEGG GK Cell death Embryogenesis virtual chip GenNav Literature References ….. 15 DBs 21 Applications Life Science Application: SC2002 HPC Challenge D-Net based Global Collaborative Real- Time Genome Annotation Genome Annotation

Interactive Editor & Visualisation Download sequence from Reference Server Save to Distributed Annotation Server KEGG Inter Pro SMART Execute distributed annotation workflow SWISS PROT EMBL NCBI TIGR SNP GO Distributed data and computation HPC Challenge SC2002 Nucleotide Annotation Workflows Real-time sequencing in London • 1800 clicks • 500 Web access • 200 copy/paste • 3 weeks work in 1 workflow and few second execution

Homology search against viral genome DB Homology search against protein DB D-Net: Integration, interpretation, and discovery Annotation using Artemis and GenSense Annotation using Artemis and GenSense Genbank Predicted genes Gene prediction Homology search against motif DB Exon prediction Key word search Protein localization site prediction Relationship between SARS and other virus Splice site prediction Protein interaction prediction Multiple sequence alignment GeneSense Ontology Relationship between SARS virus and human receptors prediction Phylogenetic analysis Mutual regions identification Immunogenetics Classification and secondary structure prediction Microarray analysis SARS patients diagnosis Epidemiological analysis Bibliographic databases Bibliographic databases Discovery Net in Action:China SARS Virtual Lab

Discovery Net in Action: SARS Virus Mutation Analysis

5. What do Scientist Really Want? Does it really work?

Workflow Deployment: Grid Service and Portal Native MPI OGSA-service Condor-G Web Service Web Wrapper Sun Grid Engine Unicore Oralce 10g Towards Compositional Grid Services Resource Mapping Service Browsing Workflow Execution A compositional GRID Workflow Authoring Composing services Workflow Warehousing Service Abstraction Workflow Management Collaborative Knowledge Management

Discovery Net Service Composition

Full Workflow

Executing Protein Annotation Workflow

Deployment of Node

Deploying Protein Annotation Workflow

Executing Deployed Service

Locating & Executing Deployed Service from Discovery Net

Workflow Provenance

Workflow Warehousing

Discovery Net Snapshot Real Time Data Integration Discovery Services Service Workflow Operational Data Literature Instrument Data Databases Integrative Knowledge Management Dynamic Application Integration Using Distributed Resources Images Scientific Information Scientific Discovery In Real Time

Distributed Data Mining in Discovery Net