Peter Brezany, Ivan Janciak, Alexander W ö hrer, A Min Toja University of Vienna

GridMinerA Framework for Knowledge Discoveryon the Grid – from a Vision to Design and Implementation Peter Brezany, Ivan Janciak, Alexander Wöhrer, A Min Toja University of Vienna Institute for Software Science email: janciak@par.univie.ac.at

GridMiner Overview • Start: Jan. 2003 • Host: University of Vienna Vienna University of Technology • Target: • provide tools to discover and access relevant knowledge and information from different distributed and heterogeneous data sources • Test application area: medical • traumatic brain injury treatment • Predicting the outcome of seriously ill patients • analytical part focuses on data mining and On-Line Analytical Processing (OLAP)

Project members

Outline • Motivation/ Requirements • GridMiner Services • Architecture • Dynamic Service Composition Engine • OLAP • Knowledge base • Data Integration • Graphical user interface • Implementation • Summary

The process to cover • Data distributed over participating hospitals • accesses from different platforms (hand held, PC,…) for data generation, querying, analysis • Process needs to access various data sources

GridMiner • Motivation • integrate knowledge discovery and knowledge management as an autonomic system • manage and control whole lifecycle of knowledge • give a strong support to other intelligent entities in their needs for knowledge • Basic Requirements • Ability to access and analyze a huge amount of information – typically heterogeneous and geographically distributed • Intelligent behavior ability to maintain, discover, extend, present and communicate knowledge • High performance (real-time or soft real-time) query processing • High security guarantee

GridMiner Services • Dynamic Workflow Control Service • Data mining services • Sequences (SPADE) • Clustering (SimpleKMeans) • Decision rules (SPRINT) • OLAP (sequential/parallel version) • Association rules on OLAP • Grid Data Mediator Service

GridMiner Architecture User environment Graphical User Interface Knowledge Base Service configuration DSCE Client Web Dynamic service control engine (DSCE) Grid Data Access and Integration Data mining services

Dynamic Service Control Engine • Process a workflow described by DSCL. • Based on the Open Grid Services Architecture • Supports both interactive and batch processing • User independent processing of the workflow • Provision of all intermediate results from the involved services • Full user control during workflow execution • Supports the OGSA Notification Model

Dynamic Service Control Engine (cont.)

Knowledge Base Rules SWRL OWL Facts Web Ontology Language OWL + OWL-S Datatsource Ont. Activity Ontology Datamining Ont. Domain Ontology XML ,XML Schema (XSL) (webrowset,pmml…) Metadata

OLAP • Multidimensional data analysis by sequential and distributed / parallel OLAP engines. • Cube construction and querying • Representation of query results by OLAP Modeling Markup Language • Integration with data mining engines (Association rules on OLAP)

Grid Data Mediation Service Principles • Tight Federation: • global (relational) schema • Virtual integration: • let the data where it is • always up-to-date data • No proprietary solution • inherit well solve aspects from OGSA-DAI • Not bound to special architecture • Supported data sources: • RDBMS (via JDBC), XMLDB (Xindice), CSV files • Operators: “Union all” and “inner join” • Operators are XQuery based (using SAXON)

Data Integration Scenario • Heterogeneities: • Name in A is „First Last“ (as the target format) • Name in C has to be combined • Distribution: • 3 data sources

Data Integration Scenario (cont.) • Query: SELECT p_name FROM patient WHERE id=10 Standard to optimized

Implementation/Technology • Globus 3.2 • OGSA/DAI • GUI – Workflow constructions/Results visualization (JGraph, Java web Start, Java server pages) • Service Configuration (Java server pages/PHP/..) • Knowledge base – (XML,OWL)

Decision Rules (SPRINT) (Select 10k rows) (Select 20k rows) Decision Rules (C45) Database (100k rows) Decision Rules (C45) Data mining Scenario

Graphical User Interface

Summary • Integrated data mining infrastructure • Covers the whole process • Service Oriented Architecture • Implemented Prototype • Project ongoing • New data mining tasks (algorithms) • Knowledge management • More information: http://www.gridminer.org

Thank you Questions?

Peter Brezany, Ivan Janciak, Alexander W ö hrer, A Min Toja University of Vienna