290 likes | 409 Views
KDDML: A Middleware Language and System for Knowledge Discovery in Databases. Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F. Turini Thirteenth Italian Symposium on Sistemi Evoluti per Basi di Dati (SEBD-2005) Brixen, Italy – 19-22 June, 2005. Application Area: KDD.
E N D
KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F. Turini Thirteenth Italian Symposium onSistemi Evoluti per Basi di Dati (SEBD-2005) Brixen, Italy – 19-22 June, 2005
Application Area: KDD • Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying • valid, • novel, • potentially useful, • understandable patterns in data. SEBD 2005 - Brixen, June 2005
The CRISP-DM process • Main focus on automatic-phases: • Data pre-processing • Modeling • Post-processing • Model evaluation SEBD 2005 - Brixen, June 2005
In this work • KDDML: an XML-based middlewarelanguage and system in support of the KDD process. • KDDML as language. • KDDML as system. SEBD 2005 - Brixen, June 2005
Requirements • R1: data/models repository should be available for storing input, output and intermediate objects of the KDD process. • Several representations of data can be available. • Automatic format conversions. • Automatic meta-data mapping (e.g., ARFF, SQL). • R2: specifying logical meta-data (meta-model) in addition to the physical data (model). • R3: compositionality of mining operations in the design of the language (closure principle). • R4: high extensibility of the system architecture. SEBD 2005 - Brixen, June 2005
KDDML as XML-based System • XML as data/model representation (R1, R2). • Machine-processable language. • XML as language definition. • Ensures compositionality of operators (R3). • Extensibility and modularity (R4). SEBD 2005 - Brixen, June 2005
Data/Model Representation SEBD 2005 - Brixen, June 2005
Data Format • Separing the logical data from the physical instances. • Data schema via proprietary XML. • Actual data stored in CSV (Comma Separated Values). • CSV has been chosen as a trade-off between readability (binary file) and space occupation (xml). SEBD 2005 - Brixen, June 2005
Physical Data Logical Metadata Data Format: Example <KDDML_TABLE data_file=“census.csv”> <SCHEMA logical_name=“census” number_of_attributes=“6” number_of_instances=“16”> <ATTRIBUTE name=“age” number_of_missed_values=“0“ type=“numeric”> <NUMERIC_DESCRIPTION mean=“40.75” variance=“237.8” min=“18.0” max=“70.0”/> </ATTRIBUTE> <ATTRIBUTE name=“education” number_of_missed_values=“3“ type=“nominal”> <NOMINAL_DESCRIPTION number_of_values=“4”> <VALUE value=“HS-grad” cardinality=“3”/> <VALUE value=“masters” cardinality=“2”/> …. </NOMINAL_DESCRIPTION> </ATTRIBUTE> …. </SCHEMA> </KDDML_TABLE> SEBD 2005 - Brixen, June 2005
Model Format • PMML (Predictive Model Markup Language) • An industry standard for actual models representation as XML documents. • Consists of DTDs for a wide spectrum of models, including RdA, decision trees, clustering, regression, neural networks. • It does not cover the process of extracting models, but the exchange of extracted knowledge. SEBD 2005 - Brixen, June 2005
Logical Metadata Physical Model Model Format: Example <PMML version="2.0"> …. <DataDictionary> <DataField name="id" optype="continuous" /> … <DataField name="amount" optype="continuous" /> </DataDictionary> <TreeModel modelName="censusTree" splitCharacteristic="multiSplit"> <MiningSchema> <MiningField name="id" usageType="supplementary" /> … <MiningField name="class" usageType="predicted" /> </MiningSchema> <Node score="" recordCount="48842"> <True/> <ScoreDistribution value="<=50K" recordCount ="37155" /> ... </Node> </PMML> SEBD 2005 - Brixen, June 2005
Language SEBD 2005 - Brixen, June 2005
Closure Principle (1) • Arguments of an operator must be of an appropriate type and sequence. • We denote the signature of an operator op:t1 x … x tn t by defining a DTD for KDDML queries that constraints sub-elements to be of type t1, … , tn. SEBD 2005 - Brixen, June 2005
Closure Principle (2) <!ELEMENT TREE_CLASSIFY ((%kdd_query_trees;), (%kdd_query_table;))> <!ATTLIST TREE_CLASSIFY xml_dest %string; #IMPLIED> Where: • kdd_query_trees: all operators returning a classification tree; • kdd_query_table: all operators returning a table; • TREE_CLASSIFY belongs to the kdd_query_table entity. fTREE_CLASSIFY: tree x table table SEBD 2005 - Brixen, June 2005
KDDML Types • The set of types of KDDML operators consists of: • Table, PPtable • Tree, clusters, rda, sequence, hierarchy • Algs, condition, expression SEBD 2005 - Brixen, June 2005
KDDML Query structure <OPERATOR_NAME xml_dest="results.xml" att1="v1" ... attM="vM"> <ARG1_NAME> .... </ARG1_NAME> ... <ARGn_NAME> .... </ARGn_NAME> </OPERATOR_NAME> • The structure of a KDDML query has a precise format. • XML tags element correspond to operation on data and models; • XML attributes correspond to parameters of those operations • XML sub-elements define the arguments passed to the operators (KDDML Types). SEBD 2005 - Brixen, June 2005
Example (1) • Construction and application of a decision tree. • Loading of an ARFF source as training set. • Simple sampling on training set. • Construction of a decision tree on sampled training set. • Target attribute: play. • Algorithm: C4.5. • Loading of a test set from the system repository. • Application of the decision tree on the test set. SEBD 2005 - Brixen, June 2005
Tree Miner Alg: c4.5 Pruning confidence: 40% Num instances: 6 Repository ARFF Arff Loader Source: weather.arff Sampling Alg: simple sampling Percentage: 66% Tree Classify Table Loader Source: weather_test.xml Repository Data Example (2) ... <PP_SAMPLING> <ARFF_LOADER .../> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM> </PP_SAMPLING> ... ... <ARFF_LOADER arff_file_name="weather.arff"/> ... ... <TABLE_LOADER xml_source="weather_test.xml"/> ... <KDDML_OBJECT> <KDD_QUERY name="sample"> <TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> <ARFF_LOADER arff_file_name="weather.arff"/> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM> </PP_SAMPLING> <ALGORITHM algorithm_name=“C4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM> </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/> </TREE_CLASSIFY> </KDD_QUERY> </KDDML_OBJECT> ... <TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> ..... </PP_SAMPLING> <ALGORITHM algorithm_name=“c4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM> </TREE_MINER> ... <TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER ....> .... </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/> </TREE_CLASSIFY> SEBD 2005 - Brixen, June 2005
Language Operators • Data/Model access. • Preprocessing. • Data Cleaning, Sampling, Normalization, Discretization. • Model Extraction. • Model application and evaluation. • Model meta-reasoning & filtering. SEBD 2005 - Brixen, June 2005
Example one: Discretization Discretization of a numeric attribute “age” into three intervals using the natural binning method. .... <PP_NUMERIC_DISCRETIZATION xml_dest= "census_discrete.xml", attribute_name = "age", label_type = "enumeration", enumerated_label_list = "young, middle, old"> <TABLE_LOADER xml_source= "census.xml"/> <ALGORITHM algorithm_name="natural_binning"> <PARAM name="cardinality" value="3"/> <PARAM name="having_number_of_intervals" value="true"/> </ALGORITHM> </PP_NUMERIC_DISCRETIZATION> .... SEBD 2005 - Brixen, June 2005
Example two: RdA filtering Selects the rules with item “bread” in the body and not having the item “milk” in the head and having exactly two items in the head and having the support greater than 30%. .... <RDA_FILTER> <RDA_LOADER xml_source="rules.xml"/> <CONDITION> <AND_COND> <BASE_COND op_type="is_in" term1="@body" term2="bread"/> <BASE_COND op_type="is_not_in" term1="@head" term2="milk"/> <BASE_COND op_type="equal" term1="@head_cardinality" term2="2"/> <BASE_COND op_type="greater" term1="@support" term2="0.3"/> </AND_COND> </CONDITION> </RDA_FILTER> .... SEBD 2005 - Brixen, June 2005
System Architecture SEBD 2005 - Brixen, June 2005
Design targets • Extensibility • Data sources • Algorithms • Models • Portability • Modularity. • Architecture structured in 3 layers. SEBD 2005 - Brixen, June 2005
To upper layers… Interpreter Layer Operators Layer Repository Layer Data Models Architecture Layers • Repository Layer: • Manages the read/write access to data and models repository. • Manages the read/write access to data and models from external sources. • Give a programmatic functionality to the higher layers. • Interpreter Layer: • Accepts a validated KDDML query and returns the result as XML document. • Recursively traverse the DOM tree representation. • The interpreter is not-affected by data/algorithms/model extensibility. • Operators Layer: • Implementation of language operators. • <OPERATOR_NAME> is implemented as a Java class satisfying an interface. • Interface is task-dependent. SEBD 2005 - Brixen, June 2005
Interpreter Layer Operators Layer Repository Layer Data Models KDDML as Middleware System High Level GUI MQL Query MQL Results Query KDDML Results Compiler Query KDDML SEBD 2005 - Brixen, June 2005
Experiences with KDDML SEBD 2005 - Brixen, June 2005
ClickWorld • Extract DM models from visits to a city-news portal with the intent to characterize topics-of-interest of new visitors. • M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, F. Turini Preprocessing and mining web log data for web personalization.8th Italian Conf. on Artificial Intelligence : 237-249. Vol. 2829 of LNCS, September 2003. SEBD 2005 - Brixen, June 2005
OP OP2 OP1 OP3 KDDML-G • A system for KDD on the GRID. • Exploit the parallelism offered by the GRID • Data immovability by moving the code on the place. SEBD 2005 - Brixen, June 2005
Download KDDML http://kdd.di.unipi.it/kddml/ GNU (General Public Licence) SEBD 2005 - Brixen, June 2005