450 likes | 594 Views
Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture. Peter Breza ny University of Vienna. Collecting Data. Satellites. Laboratories (microscopes, MRI/CT scanners, ...). Data Re- positories. Business. Analysis. Experiments (high energy physics,...).
E N D
Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna
Collecting Data Satellites Laboratories (microscopes, MRI/CT scanners, ...) Data Re- positories Business Analysis Experiments (high energy physics,...) Computer simulations
Computational Grid – a new-generation infrastructure Challenge: Advanced analysis of data managed by Grid Typical data in modern Grid applications: files, file collections, relational and XML DBs, virtual data, data objects The data is often is large, geographically distributed and its complexity is increasing; some applications require special security precautions. Our research aims: Phase 1 : Knowledge discovery Grid system (GridMiner) Phase 2 : Intelligent Grid system (WisdomGrid) Motivation
Motivation Background and Related Work Basic Concepts and GridMiner Architecture Grid Data Integration System Data Mining Layer Implementation Issues and Experiments Future Research Conclusions Outline
Basic Grid development (Globus 1) – metacomputing Data Grid (Globus 2, DataGrid of CERN, etc.) Semantic Grid (myGrid) Open Grid Service Architecture (Globus 3, OGSA-DAIS) Parallel and Distributed Data Mining and Data Warehousing Knowledge Grid (GridMiner and work of others) Web Intelligence Background and Related Work
Open architecture Data distribution, complexity, heterogeneity, and large data size Applying different kinds of analysis strategies Compatibility with existing Grid infrastructure Openness to tools and algorithms Scalability Grid, network, and location transparency Security and data privacy OLAP support GridMiner Requirements
GridMiner (Layered) Abstract Architecture User Interface Knowledge Grid Data to Knowledge Control Information Grid Computational & Data Grid Built on the K.G. Jeffery‘s proposal
GridMiner Conceptual Architecture J o b C o n t r o l
Service Architecture Based on OGSA-DAIS
Data Distribution Scenarios • Single data source • Federated data sources with different types of partitioning
Example Vertical and horizontal distribution of the virtual data source
GridMiner Service Factory GridMiner Service Registry GridMiner Data Mining Service GridMiner Preprocessing Service GridMiner Presentation Service GridMiner Orchestration Service Components of the Data Mining Layer
Implementation of the Mediation Service for horizontal data partitioning Implementation of Data Mining Services for decision tree construction as OGSA conformous Grid service, based on the Globus Toolkit 3 Release We use a freely available Java-based data mining system Weka (data preprocessing and data mining tasks) – (main memory oriented) a home-grown Java implementation of the algorithm SPRINT (disk-oriented) Implementation Prototype
Test data suites synthetical data (generated by an extended version of the IBM Quest Synthetic Data Generation Code) TBI (Traumatic Brain Injury) databases Grid testbed Vienna CERN Dublin Zagreb Cracow Goals in the first phases Verifying model accuracy Overhead of the service layers Experimental Environment
Example: Mining Patterns for Data Classification and Associations use databasedat1, dat2 mine classifications analyze patient_outcome usingg_parsimony display astree use databaseDBs attributes mine associations usingmethod_attributes display asrules
WG Architecture Domain Knowledge Agents Knowledge Explorer Agent Wisdom Grid Agent Platform Agent Grid Service KB External Knowledge Base Knowledge Base Service External Services Knowledge Discovery Service Grid End User (personal) Agent
Work-Flow External Agents End User Agent Knowledge Agent Knowledge Explorer Agent Knowledge Base service Knowledge discovery service Agent Service Services ... Knowledge Base
Knowledge Discovery Service • Client for other services • Knowledge Discovery in Databases • GridMiner • data mining • on-line analytical processing (OLAP) • Web Mining • semantic web • Online libraries • Web/Grid Services • Knowledge Explorer Agent
Knowledge Base Service / KB • KBS - Search, Query, Expand Knowledge Base • KB- Database that stores particular data about real objects and relations between these objects and their properties • Consists of ontologies and instances • Information about resources (location, query lang.) • on the Web • web/grid services ,agents • references to the online database • Languages • XML/RDF/DAML-OIL/DAML-S/OWL
Ontology - example DAML-OIL Language: <daml:Class rdf:ID=“Human”> <rdfs:subClassOf> <daml:Restriction cardinality=“1”> <daml:onProperty rdf:resource= “#Age”/> </daml:Restriction> </rdfs:subClassOf> </daml> <daml:DatatypeProperty about:ID=“Age”> <rdf:domain rdf:resource = “#Human”/> </daml:DatatypeProperty> <daml:Class rdf:ID=“Patient”> <daml:subClassOf rdf:resource=“#Human”/> </daml:Class> Patient is Human has Age
Knowledge Base - example has has Temperature Human Value is Patient has has Attribute Tables Database attribute:PAT_ID table:PATIENTS jdbc://foo/hospital
Distributed heterogeneous databases Different database schemas Different query languages Different names of attributes/tables… but the same semantics ! WG enables semantics mediation at a higher level Semantic mediator
Semantic mediator (cont.) AGE PAT_AGE Patient samePropertyAs is Database in Hospital X has Age Human has Blood Type Database in Hospital Z samePropertyAs PAT_BLOOD_TYPE BT
Distributed Knowledge base uri:fooY#Human is subclass Class has property Class property Is same class as uri:fooZ#Temperature uri:fooX#Patient class uri:fooX#Ill_Person
Agent Grid Service • Supports system with ability to communicate with the outside world in standard languages • FIPA Standards • ACL – Agent Communication Language • KQML- Knowledge Query and Manipulation Language • Agent Platform (JADE,FIPA-OS) • Agents • Domain Knowledge Agent • Knowledge Explorer Agent • End-user Agent (personal)
Querying • End-user agent • with own ontology – subset of ontology • Merging of ontologies • without own ontology • Negotiating about domain of interest • Queries created from ontology • Templates <Patient rdf:ID=“ID001”> <Temperature/> </Patient>
Mined Knowledge (GridMiner) Decision trees/ rules (clinical pathways) Association rules Instances of domain ontology Particular data References Links to Web sites Information about another knowledge providers Answers
GridMiner Case Study - Medical Application Semantic Web/Grid Knowledge Explorer Agent Knowledge Agent Knowledge Discovery Service Testset Q: Outcome? + data about patient’s condition resources A: probability of survival + references to the diagnoses Training set Knowledge Base Hospital Databases End User (personal) Agent
Application and extension of the Grid technology to knowledge discovery – an important, but non-traditional Grid application domain Introduction of a new Grid Data Mediation Service Future work Performance evaluation on large synthetic data volumes Coupling of the Data Minining services architecture with the OLAP services architecture Development of a knowledge discovery oriented Grid Workflow Language and the appropriate Workflow Engine Application of GridMiner to a real medical application (management of patients with severe traumatic brain injuries) Development of the Wisdom Grid Conclusions and Future Work