Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture

Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna

Collecting Data Satellites Laboratories (microscopes, MRI/CT scanners, ...) Data Re- positories Business Analysis Experiments (high energy physics,...) Computer simulations

Computational Grid – a new-generation infrastructure Challenge: Advanced analysis of data managed by Grid Typical data in modern Grid applications: files, file collections, relational and XML DBs, virtual data, data objects The data is often is large, geographically distributed and its complexity is increasing; some applications require special security precautions. Our research aims: Phase 1 : Knowledge discovery Grid system (GridMiner) Phase 2 : Intelligent Grid system (WisdomGrid) Motivation

Motivation Background and Related Work Basic Concepts and GridMiner Architecture Grid Data Integration System Data Mining Layer Implementation Issues and Experiments Future Research Conclusions Outline

Basic Grid development (Globus 1) – metacomputing Data Grid (Globus 2, DataGrid of CERN, etc.) Semantic Grid (myGrid) Open Grid Service Architecture (Globus 3, OGSA-DAIS) Parallel and Distributed Data Mining and Data Warehousing Knowledge Grid (GridMiner and work of others) Web Intelligence Background and Related Work

Open architecture Data distribution, complexity, heterogeneity, and large data size Applying different kinds of analysis strategies Compatibility with existing Grid infrastructure Openness to tools and algorithms Scalability Grid, network, and location transparency Security and data privacy OLAP support GridMiner Requirements

GridMiner (Layered) Abstract Architecture User Interface Knowledge Grid Data to Knowledge Control Information Grid Computational & Data Grid Built on the K.G. Jeffery‘s proposal

GridMiner Conceptual Architecture J o b C o n t r o l

Service Architecture Based on OGSA-DAIS

Data Distribution Scenarios • Single data source • Federated data sources with different types of partitioning

Example Vertical and horizontal distribution of the virtual data source

Mapping Schema

Grid Data Mediation Services

Architecture of a Data Mining System

GridMiner Service Factory GridMiner Service Registry GridMiner Data Mining Service GridMiner Preprocessing Service GridMiner Presentation Service GridMiner Orchestration Service Components of the Data Mining Layer

Centralized Data Mining

Parallel and Distributed Data Mining

GridMiner Orchestration Service

GridMiner Job Specification Language

Implementation of the Mediation Service for horizontal data partitioning Implementation of Data Mining Services for decision tree construction as OGSA conformous Grid service, based on the Globus Toolkit 3 Release We use a freely available Java-based data mining system Weka (data preprocessing and data mining tasks) – (main memory oriented) a home-grown Java implementation of the algorithm SPRINT (disk-oriented) Implementation Prototype

Test data suites synthetical data (generated by an extended version of the IBM Quest Synthetic Data Generation Code) TBI (Traumatic Brain Injury) databases Grid testbed Vienna CERN Dublin Zagreb Cracow Goals in the first phases Verifying model accuracy Overhead of the service layers Experimental Environment

Extending theFunctionality

OLAM

Example: Mining Patterns for Data Classification and Associations use databasedat1, dat2 mine classifications analyze patient_outcome usingg_parsimony display astree use databaseDBs attributes mine associations usingmethod_attributes display asrules

Workflow 1: Interactive Mode

Workflow 2: Batch Mode

Workflow 3: Hybrid Mode

Execution Model Based on Static Workflow

Execution Model Based on Dynamic Workflow

Towards the Wisdom Grid (WG)

WG Architecture Domain Knowledge Agents Knowledge Explorer Agent Wisdom Grid Agent Platform Agent Grid Service KB External Knowledge Base Knowledge Base Service External Services Knowledge Discovery Service Grid End User (personal) Agent

Work-Flow External Agents End User Agent Knowledge Agent Knowledge Explorer Agent Knowledge Base service Knowledge discovery service Agent Service Services ... Knowledge Base

Knowledge Discovery Service • Client for other services • Knowledge Discovery in Databases • GridMiner • data mining • on-line analytical processing (OLAP) • Web Mining • semantic web • Online libraries • Web/Grid Services • Knowledge Explorer Agent

Knowledge Base Service / KB • KBS - Search, Query, Expand Knowledge Base • KB- Database that stores particular data about real objects and relations between these objects and their properties • Consists of ontologies and instances • Information about resources (location, query lang.) • on the Web • web/grid services ,agents • references to the online database • Languages • XML/RDF/DAML-OIL/DAML-S/OWL

Ontology - example DAML-OIL Language: <daml:Class rdf:ID=“Human”> <rdfs:subClassOf> <daml:Restriction cardinality=“1”> <daml:onProperty rdf:resource= “#Age”/> </daml:Restriction> </rdfs:subClassOf> </daml> <daml:DatatypeProperty about:ID=“Age”> <rdf:domain rdf:resource = “#Human”/> </daml:DatatypeProperty> <daml:Class rdf:ID=“Patient”> <daml:subClassOf rdf:resource=“#Human”/> </daml:Class> Patient is Human has Age

Knowledge Base - example has has Temperature Human Value is Patient has has Attribute Tables Database attribute:PAT_ID table:PATIENTS jdbc://foo/hospital

Distributed heterogeneous databases Different database schemas Different query languages Different names of attributes/tables… but the same semantics ! WG enables semantics mediation at a higher level Semantic mediator

Semantic mediator (cont.) AGE PAT_AGE Patient samePropertyAs is Database in Hospital X has Age Human has Blood Type Database in Hospital Z samePropertyAs PAT_BLOOD_TYPE BT

Distributed Knowledge base uri:fooY#Human is subclass Class has property Class property Is same class as uri:fooZ#Temperature uri:fooX#Patient class uri:fooX#Ill_Person

Agent Grid Service • Supports system with ability to communicate with the outside world in standard languages • FIPA Standards • ACL – Agent Communication Language • KQML- Knowledge Query and Manipulation Language • Agent Platform (JADE,FIPA-OS) • Agents • Domain Knowledge Agent • Knowledge Explorer Agent • End-user Agent (personal)

Querying • End-user agent • with own ontology – subset of ontology • Merging of ontologies • without own ontology • Negotiating about domain of interest • Queries created from ontology • Templates <Patient rdf:ID=“ID001”> <Temperature/> </Patient>

Mined Knowledge (GridMiner) Decision trees/ rules (clinical pathways) Association rules Instances of domain ontology Particular data References Links to Web sites Information about another knowledge providers Answers

GridMiner Case Study - Medical Application Semantic Web/Grid Knowledge Explorer Agent Knowledge Agent Knowledge Discovery Service Testset Q: Outcome? + data about patient’s condition resources A: probability of survival + references to the diagnoses Training set Knowledge Base Hospital Databases End User (personal) Agent

Application and extension of the Grid technology to knowledge discovery – an important, but non-traditional Grid application domain Introduction of a new Grid Data Mediation Service Future work Performance evaluation on large synthetic data volumes Coupling of the Data Minining services architecture with the OLAP services architecture Development of a knowledge discovery oriented Grid Workflow Language and the appropriate Workflow Engine Application of GridMiner to a real medical application (management of patients with severe traumatic brain injuries) Development of the Wisdom Grid Conclusions and Future Work

Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture