400 likes | 579 Views
Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London. What is Discovery Net Distributed Data Mining for Compute Intensive Tasks Distributed Data Mining for Sensor Grids Knowledge Discovery from Naturally Distributed Data Sources
E N D
Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London
What is Discovery Net Distributed Data Mining for Compute Intensive Tasks Distributed Data Mining for Sensor Grids Knowledge Discovery from Naturally Distributed Data Sources What Do Scientists Really Want?
Funding : One of the eight UK national e-science Pilot Projects funded by EPSRC (£2.2M) Start Oct 2001, End March 2005 Goal :Construct the World’s first Infrastructure for Global Knowledge Discovery Services Key Technologies: Open Service Computing High Throughput Devices and Real Time Data Mining Real Time Data Integration & Information Structuring Cross Domain Knowledge Discovery and Management Discovery Workflow and Discovery Planning What is Discovery Net?
Life Sciences High throughput genomics and proteomics Distributed Databases and Applications Environmental Modelling High throughput dispersed air sensing technology Sensor Grids Real time geo-hazard modelling Earthquake modelling through satellite imagery High performance Distributed Computation 10 9 8 7 6 5 4 3 2 1 M N A B C D E F G H I J K L Discovery Net Applications
Goal: Plug & Play Data Sources, Analysis Components & Knowledge Discovery Processes Discovery Net Architecture DPML Web/Grid Services OGSA D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides services and execution logic for distributed knowledge discovery and access to distributed resources and services Computation & Data Resources: Distributed databases, compute servers and scientific devices. High Performance Communication Protocol (GridFTP, DSTP..) Grid Infrastructure (GSI)
Generic Data Mining Classification, Clustering, Associations, .. Unstructured-Data Mining Text Mining, Image Mining Domain-specific Mining Bioinformatics, Cheminformatics, .. Discovery Net Data Mining Components
2. Distribution of Compute Intensive Tasks a. Distributed Data Mining for Geo-hazard Prediction
Data Warehousing & Modelling Co-registration & geo-rectification Image features extraction Cluster & classification Grid-based Geo-hazard Data Mining • Grid-based HPC Computation • Workflow to Co-ordinate Grid Computation • Automatically co-register a stack of imagery layers at high precision and speed. • Grid-based Data Access and Integration
Image “before” Image “after” Reading Data set Reading Data set Setting search window Setting comparing window Setting comparing window Significant correlation coefficient N Y Delta X Delta X Correlation coefficient Normalised cross-correlation (NCC) template algorithm Operating on a remotely accessed MPI UNIX parallel computer through fast network with DNet interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID.
2. Distribution of Compute Intensive Tasks b. Distributed Clustering
3. Distributed Mining over Sensor Grid Data Distributed Spatial Data Mining for Air Pollution Modelling
Sensor Specification The GUSTOProject - Update(Generic UV Sensors Technologies & Observations) • High throughput open path spectrometer system • Robustalgorithm for pollutant concentration retrievals • Measures SO2, NO, NO2,O3 & Benzene to ppb levels every few seconds • Geared for networking of multiple GUSTO units within a GRID Infrastructure • Can support Remote Sensing data for (contour) mapping of pollutants www.gusto-systems.com
GUSTOunit1 Wireless connectivity Monitoring and control software Sensor registry & control service GUSTOunit2 SensorML HTTP, SOAP, GSI HTTP, SOAP, GSI Data access service Warehouse Data upload service Public access Web visualizer GUSTOunit3 Archived weather data Visualisation and Data Mining Archived health data GUSTOunit4 GRID Infrastructure Networking of Multiple GUSTO Units www.gusto-systems.com
4. Knowledge Discovery from Naturally Distributed Data Sources Distributed Data Mining in Life Sciences
secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments receptors signals pathways ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT linkage maps cytogenetic maps physical maps Distributed Data Mining for Life Sciences
Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose. Design an integration schema? Information Integration Gene Expression Warehouse ExPASy SwissProt PDB ExPASy Enzyme OMIM Enzyme Disease Protein Affy Fragment LocusLink Known Gene MGD Sequence Pathway SNP Metabolite SPAD Sequence Cluster Genbank KEGG NCBI dbSNP NMR UniGene
From Data Integration to Knowledge Unification In Silico Experiment D-World I-World K-World
Identify Organism Chromosomes Organism’s DNA Identify Genes tRNAs, rRNAs High Throughput Sequencers Gene markers Non-translated RNAs EMBL NCBI genscan blast Regulatory Regions Repetitive Elements TIGR SNP grail Repeat Masker Segmental Duplication SNP Variations Nucleotide-level Annotation Literature References ….. E-PCR genscan Identify Proteins Classify into Protein Families Inter Pro Inter Pro blast 3D-PSSM Functional Characteisation Homologues SMART SWISS PROT Domain 3-D Structure PFAM Motif Search Protein-level Annotation Fold Prediction Secondary structure predator DSC Literature References ….. Relate Pathway Maps Ontologies Cell Cycle Metabolism GO CSNDB Process-level Annotation Drugs Biological Process….. AmiGO GeneMaps KEGG GK Cell death Embryogenesis virtual chip GenNav Literature References ….. 15 DBs 21 Applications Life Science Application: SC2002 HPC Challenge D-Net based Global Collaborative Real- Time Genome Annotation Genome Annotation
Interactive Editor & Visualisation Download sequence from Reference Server Save to Distributed Annotation Server KEGG Inter Pro SMART Execute distributed annotation workflow SWISS PROT EMBL NCBI TIGR SNP GO Distributed data and computation HPC Challenge SC2002 Nucleotide Annotation Workflows Real-time sequencing in London • 1800 clicks • 500 Web access • 200 copy/paste • 3 weeks work in 1 workflow and few second execution
Homology search against viral genome DB Homology search against protein DB D-Net: Integration, interpretation, and discovery Annotation using Artemis and GenSense Annotation using Artemis and GenSense Genbank Predicted genes Gene prediction Homology search against motif DB Exon prediction Key word search Protein localization site prediction Relationship between SARS and other virus Splice site prediction Protein interaction prediction Multiple sequence alignment GeneSense Ontology Relationship between SARS virus and human receptors prediction Phylogenetic analysis Mutual regions identification Immunogenetics Classification and secondary structure prediction Microarray analysis SARS patients diagnosis Epidemiological analysis Bibliographic databases Bibliographic databases Discovery Net in Action:China SARS Virtual Lab
5. What do Scientist Really Want? Does it really work?
Workflow Deployment: Grid Service and Portal Native MPI OGSA-service Condor-G Web Service Web Wrapper Sun Grid Engine Unicore Oralce 10g Towards Compositional Grid Services Resource Mapping Service Browsing Workflow Execution A compositional GRID Workflow Authoring Composing services Workflow Warehousing Service Abstraction Workflow Management Collaborative Knowledge Management
Discovery Net Snapshot Real Time Data Integration Discovery Services Service Workflow Operational Data Literature Instrument Data Databases Integrative Knowledge Management Dynamic Application Integration Using Distributed Resources Images Scientific Information Scientific Discovery In Real Time