340 likes | 480 Views
Planning to Learn with a Knowledge Discovery Ontology. Monika Žáková, Petr Křemen, Filip Železný (Czech Technical University, Prague) Nada Lavrač (Institute Jozef Stefan, Ljubljana). Motivation. FP6 SEVENPRO project : “semantic engineering environment”
E N D
Planning to Learn with a Knowledge Discovery Ontology Monika Žáková, Petr Křemen, Filip Železný(Czech Technical University, Prague) Nada Lavrač(Institute Jozef Stefan, Ljubljana)
Motivation FP6 SEVENPRO project: “semantic engineering environment” • integration of knowledge from various sources e.g. different CAD software, ERP, etc. by means of a layer of semantic annotations • a significant part of engineering knowledge has a rich relational structure (CAD designs, documents, simulation models, ERP databases) traditional ML techniques and tools unsuitable Goals: • making implicit knowledge contained e.g. in CAD designs explicit for reuse, training, quality control • develop a tool for RDM capable of dealing with semantic annotations and producing results in a semantic format
Example in the CAD ontology: <rdfs:Classrdf:ID="PrismSolFeature"></rdfs:Class> <rdfs:Classrdf:ID="SolidExtrude"> <rdfs:subClassOfrdf:resource="#PrismSolFeature"/> </rdfs:Class> declaring it in background knowledge: subclass(prismSolFeature, solidExtrude). hasFeature(B, F1):-hasFeature(B,F2),subclassTC(F1,F2). problemwith subsumption: C = liner(P):-hasBody(P,B),hasFeature(B,prismSolFeature). D= liner(P):-hasBody(P,B),hasFeature(B,solidExtrude). itdoes not hold C D clause D not obtained by applying a specialization refinement operator onto clause C our approach: extend refinement operator with taxonomies on predicates and terms
Sorted Refinement Downward Δ,Σ-refinement • extension of sorted refinement proposed by Frisch • defined using 3 refinement rules: • adding a literal to the conjunction • replacing a sort with pred1(x1:τ1,…,xn:τn) with one of its direct subsortspred1 (x1:τ1’,…,xn:τn) • replacing a literal pred1(x1:τ1,…,xn:τn) with one of its direct subrelationspred2(x1:τ1,…,xn:τn)
Feature Taxonomy • information about feature subsumption hierarchy stored and passed to the propositional learner • assume that features f1,…, fnhave been generated with corresponding conjunctive bodies b1,…, bn • elementary subsumption matrix E of n rows and n columns is defined such that Ei,j= 1 whenever bi∈ρΔ,Σ(bi) and Ei,j= 0 otherwise • exclusion matrix X of n rows and n columns is defined such that Xi,j= 1 whenever i= j or bi∈ ρΔ,Σ(ρΔ,Σ(… ρΔ,Σ(bj) …)) and Xi,j= 0 otherwise.
Propositional Rule Learning 2 propositional algorithms adapted to utilize matrices E, X • Top-down deterministic algorithm • stems from the rule inducer of RSD • Stochastic local DNF algorithm • (Rückert 2003, Paes 2006) • search in the space of DNF formulas • refinement done by local non-deterministic DNF term changes • using matrices E, X can: • prevent the combination of a feature and its subsumee within the conjunction (both) • specialize a conjunction by replacing a feature with its direct subsumee (Top-down only)
RDM Core Overview Predicates declarations Propositional learning (Weka, R) mode hasBody( +CADPart, -Body). mode hasMaterial(+CADPart, -Material). mode hasSketch(+CADPart, -Sketch). mode hasLength(+Sketch, -float). Features Sort theory Feature construction subClassOf(CADPart,CADEntity). subClassOf(CADAssembly,CADEntity). … subPropertyOf(hasCircularSketch, hasSketch). subPropertyOf(firstFeature, hasFeature). Propositional rule learning (adapted) Background knowledge (Hornlogic) Feature subsumption table Examples Subsumption and exclusion matrix eItem(eItemT_BA1341). eItem(eItemT_BA1342). eItem(eItemT_BA1343).
RDM Manager = tool developed for running the RDM tasks Functionalities: • Obtaining relevant data by means of SPARQL query to semantic repository • Converting data from semantic representation into format acceptable by the DM algorithms (Prolog, arff, csv, etc.) • Propositionalization by generating first order features • Enhanced propositional rule learning algorithms • Third party propositional learning algorithms integrated by means of wrappers e.g. • rule learner RIPPER (Cohen 1995) • association rules - Apriori • decision trees – J48 algorithm (for all above WEKA implementation used) • clustering – distance-based PCA (implemented in R) • Storing information about DM processes and their results in semantic representation
Knowledge Discovery Ontology Foreseen queries that guided the design of the ontology • User: • Give me all rule-based classifiers found for class C on dataset D with error estimate < 5% • Give me the rule-based algorithm with shortest average runtime for datasets D, E and F • Developer: • Give me all pairs of model classes with equivalent expressiveness for which no conversion program is available • Give me all parameter settings for experiments with dataset D and algorithm A and their respective runtimes accuracy results
Example Queries to the KD ontology • Obvious idea: if the system knows all it can do, it can plan complex KD workflows • Example: a planning system queries to the ontology for generating decision tree from a relational dataset through propositionalization • Give me a program that takes a classified relational dataset represented as Prolog facts and produces an arff file • A program that take an arff file and produces a decision tree
Motivation for Workflow Generation • user: • RDM algorithms utilizing background knowledge and relational learning through propositionalization and subsequent propositional learning quite complex we want to hide as much of complexity as possible from the user • developer/data miner: • storing information about the whole process repeatability of experiments • individual components developed by different people can focus on experimenting with parameters of some components and view other as black box
Main Classes of KD Ontology • main notions : Knowledge andAlgorithm • representation language: OWL-DL • densely interlinked knowledge structures, not just taxonomies • highly optimized reasoners available (Pellet, RacerPro, Fact++, ...)
Knowledge 5 subclasses: • Dataset • LogicalKnowledge • NonLogicalKnowledge • Pattern = MiningResult • multiple formats may be attached to each Knowledge class • each knowledge instance has a specified KnowledgeFormat Knowledge and example some Example subclassOf Knowledge andhasExpressivitysome Expressivity andhasFormatsomeKnowledgeFormat Knowledgeand notLogicalKnowledge Knowledge andproducedBysomeAlgorithmExecution
Expressivity Expressivity hierarchy Protégé
Algorithms Algorithm • a mapping from knowledge to knowledge • not just induction, all executable elements incl. preprocessing, ... • definition of inputs, outputs and parameters ApriorisubclassOfNamedAlgorithmand inputsome (Dataset andhasExpressivityonlySingleRelationStructure and format only {ARFF,CSV}) and output some (MiningResultand contains onlyAssociationRule)and minMetricsome doubleand minSupportsome doubleand numOfRulessomepositiveInteger
Algorithms (2) • atomic (named) vs. composite (workflows) • types of algorithms modeled as classes e.g. ClusteringAlgorithm • each algorithm description is modeled as a subclass of class NamedAlgorithm (like Apriori above) • instances of class AlgorithmExecution represent executions of algorithms • thus, to access a particular algorithm, we need to pose a schema query to the OWL ontology – SPARQL-DL
Pattern • Result of a data mining algorithm • Describes a mapping from knowledge to knowledge • Defined as: • Example: association rules KnowledgeandproducedBysomeAlgorithmExecution subclassOf contains only (AtomicKnowledge andsingleResultAnnotationsomeanySimpleType) MiningResult andproducedByonlyAssociationRulesAlgorithmExecution and contains only AssociationRule AssociationRulesubclassOfAtomicKnowledge and antecedent some And and consequent some And and confidence some double and support some double
Anticipated Usage of the KD Ontology • a specialization of relevant OWL-S ontology parts – mainly the Process class. • during the planning inputs and outputs will be matched w.r.t. their format and expressivity to filter out invalid algorithm bindings • beyond the workflow generation : • management of the SoA knowledge in the KD domain • storing and managing KD workflow results – for example for meta-learning, experiment repeatibility
Workflow Construction Automatic workflow construction • Converting KD task described using classes from the KD ontology into a planning problem described in PDDL • Generating a plan using a planning algorithm • Storing the generated abstract workflow in form of semantic annotation • Instantiating the abstract workflow with specific algorithm configurations available in the KD ontology
Workflow-related Classes of KD Ontology KD ontology extended with workflow-related classes: • ProblemDescription– defined using properties • init specifying the available input data and knowledge • goal specifying the desired results • Action– defined by • Algorithm, which is executed • startTime, durationand • immediately preceedingActions • Workflow– currently a DAG of Actions with a link to ProblemDescription from which it was generated
Problem Description Example • Example: generating relational association rules from a classified relational dataset with relational background knowledge expressed in OWL-DL RelationalAssociationRules subClassOfProblemDescription and goal some (MiningResult and contains onlyAssociationRule) and init some (LogicalKnowledge andhasExpressivitysome OWL-DL andhasFormatsome {RDFXML}) and init some (LogicalKnowledge andhasExpressivitysomeRelationalStructure andhasFormatsome {RDFXML}) and init some (ClassifiedInstanceSet andhasFormatsome {RDFXML})
Conversion into a Planning Task Described in PDDL • ontology classified using FACT reasoner to generate inferred hierarchy on algorithms, knowledge and patterns • names generated for classes defined using OWL restrictions • domain description in PDDL • generated by converting Algorithms into PDDL actions, with inputs specifying the preconditions and outputs specifying the effects • both inputs and outputs are currently restricted to conjunction of OWL classes • problem description in PDDL • generated in the same way from ProblemDescription
Algorithm Definition Example Description in KD ontology (in DL formalism ) ApriorisubClassOfNamedAlgorithmand inputsome (Dataset andhasExpressivityonlySingleRelationStructure and format only {ARFF}) and output some (MiningResultand contains onlyAssociationRule)and minMetricsome doubleand minSupportsome doubleand numOfRulessomepositiveInteger Description used for planning (in PDDL ) (:action AprioriAlgorithm :parameters ( ?v0 – Dataset_SingleRelationStructure ?v1 – ARFF ?v2 – MiningResult_contains_AssociationRule) :precondition (and (available ?v0) (format (?v0 ?v1)) :effect (and (available ?v2))
Planning Algorithm • based on Fast-Forward planning system (Hoffman, 2001) • enforced hill climbing algorithm to perform forward state space search • goal distances estimated using relaxed GRAPHPLAN • i.e. ignoring delete lists of the operators • returns the discovered workflows with lowest number of processing steps
RDM Manager implementation RDM GUI Semantic Server Agent R D M O n t o l o g y RDM Manager Tool RDM Web Service RDM Engine Algorithm Implementation 1 Algorithm Implementation n …
Related Work (planning to learn) • Most relevant: NEXT System [Bernstein & Deanzer] • (Our best understanding:) • Linear plans • Preprocessing-Induction-Postprocessing template • We try for a template-free plan (DAG) Propositionalized Data Feature construction (inductive) Multi- relational data Propositional learning(inductive) Feature evaluation(deductive)
Related Work (DM workflows and DM assistants ) • workflows for DM • myGrid/Taverna, Triana, DataMiningGrid, Kepler, KnowledgeGrid, CAMLET, Pegasus, MiningMart • manual workflow composition, focus on workflow execution • focus on DM from relational databases • relevant efforts in formalization of DM processes • DM assistants • MetaL, StatLog - classification of DM methods, metrics for comparing the methods, finding suitable methods for a given dataset
Related Work (DM ontologies) • existing DM ontologies • ontologies for classical DM - 3 stages: induction, pre- and post-processing • focus on hierarchy of DM algorithms and propositional dataset description • DAMON – KnowledgeGrid project [Cannataro & Comito] • DataMiningGrid application description schema [Stankovski et al.] • DM ontology for IDEA [Bernstein et al.] • myGrid ontology – for bioinformatics, includes biological domain concepts http://www.mygrid.org.uk/ontology/ • other work towards KD process formalization • CinQ and IQ projects (EU FP6) • Sašo Džeroski: Towards a General Framework for Data Mining
Related Work (Semantic Web Service Composition) • essentially creating workflows based on semantic description of the ingredients • popular approach: convert semantic description to PDDL and use suitably adapted planning techniques [Klusch et al.], [Liu et al.] • we have adapted this approach for DM workflows using KD ontology • future work: individual DM algorithms as web services?
Open Issues • Reactive planning / exploration • Currently planning towards a desired kind of result, not quality • Conversion of knowledge • From more to less expressive • How can we constrain what should remain from the original information? • Can this be done at all without semantic meta-data?
Open Issues • Tighter integration of the ontology with planning • Currently: simple rewriting of algorithm annotations into PDDL actions • Work-in-progress: planner poses SPARQL queries to retrieve relevant actions • Computational platform: • GRID or web services?