Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach

2 Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach Ryszard S. Michalski and Kenneth A. Kaufman

[개요] Emergence of new research area :Data mining & Knowledge discovery

[2.1] Introduction How to extract useful, task-oriented knowledge from abundant raw data? Tradition/current methods Limitation: Primarily oriented toward the explanations of quantitative & statistical data characteristics

Continued Traditional statistical methods can But can’t

Continued 그리고 traditional methods는 스스로 domain knowledge를 취하여 자동적으로 관련된 attributes를 만들어 내지는 못한다. Goal of the researches in this field : To develop computational models for acquiring knowledge from data and background knowledge

Continued • Machine learning과 기존의 전통적인 방법을 적용하여 Task-oriented data characteristics와 generalization을 도출해낸다. • ‘Task-oriented’는 동일한 data로부터 다른 knowledge를 얻을수 있어야 함을 의미하므로 결국은 Multi-strategy approach를 요한다. (다른 task는 다른 data exploration과 knowledge generalization을 요하므로) • Multi-strategy approach의 목적은 human expert가 얻을수 있는 data description과 유사한 형태의 Knowledge를 얻는것이다. • Main constraints: • domain expert가 쉽게 이해/해석할 수 있는 Knowledge description이어야 한다.

Continued Distinction between Data mining & Knowledge discovery D-M: Application of Machine learning and other methods to the enumeration of patterns over the data K-D: The whole process of data analysis lifecycle

[2.2] Machine learning & multi-strategy data exploration • Two points to be explained here • Relationship between Machine learning methodology & goals of Data mining and Knowledge discovery • How methods of symbolic M-L can be used for (semi-)automating tasks with conceptual exploration of data and a generation of task-oriented knowledge from data?

[2.2.1] Determining general rules from specific cases • Multi-strategy data exploration is based on “Symbolic inductive learning” • Two types of data exploration operators • (1)Operators for defining general symbolic descriptions of a designed group or groups of entities in a data set. • 각 group내의 entity에 대한 공통적 특성을 기술 • ‘Constructive induction’이라고 하는 mechanism을 이용해 original data에 존재하지 않는 추상적 개념을 이용할 수 있다. • Learning “Characteristic concept descriptions”

Continued (2)Operators for defining differences between different groups of entities Learning “Discriminant concept descriptions” • Basic assumptions in concept learning • Examples don’t have errors. • All attributes have a specified values in them. • All examples are located in the same database. • All concepts have a precise(crisp) description that doesn’t change over time.

Continued • Integrating qualitative & quantitative discovery • : To define sets of equations for a given set of data points, and qualitative conditions for the application of their equations. • Qualitative prediction • :Sequence/process내의 pattern을 찾고 이것을 이용해서 미래의 input에 대해 정량적으로 예측.

[2.2.2] Conceptual clustering • Another class of machine learning methods related to D-M & K-D. • Similar to traditional cluster analysis but quite different. classical cluster기법과의 주된 차이 • Diffenence between Conceptual & Traditional clustering • In Traditional clustering : similar measure is a function only of the properties(attribute values) of the entities. • Similarity(A,B) = f(properties)

Continued • In Conceptual clustering : similarity measure is a function of properties of entities, • and two other factors Conceptual cohesiveness(A,B) = f(properties,L,E) Fig. An illustration of the difference between closeness and conceptual cohesiveness Two points A and B may be put into the same cluster in the viewpoint of the Traditional method but into the different clusters in the conceptual clustering.

[2.2.3] Constructive induction • In learning rules or decision trees from examples, the initially given attributes may not be directly relevant or irrelevant to the learning problem at hand. • Advantage of the symbolic methods over statistical methods : symbolic methods가 statistical method에 비해 non-essential attributes를 쉽게 판단할 수 있다. • How to improve the representation space • (1)Removing less relevant attributes. • (2)Generating new relevant attributes. • (3)Abstracting attributes.(or Grouping some attribute value) • “Constructive Induction” consists of two pahses • (1)Construction of the best representation space • (2)Generation of the best hypothesis in the found space above

[2.2.4] Selection of the most representative examples Usually database is very large => Process of determining, generating patterns/rules is quite time-consuming. Therefor extraction of the most representative cases of given classes is necessary to make the process more efficient. [2.2.5] Integration of Qualitative & Quantitative discovery Numerical attributes를 포함한 database의 경우 equation을 통해 이들 attributes들간의 관계를 잘 설명하는 quantitative discovery를 수행할 수 있으나 different qualitative condition에서는 이러한 고정된 quantitative equation만으로는 설명이 불가능하므로 qualitative condition에 따라 quantitative equation을 결정하는 방법이 요구된다. [2.2.6] Qualitative prediction The goal is not to predict a specific value of a variable(as in Time series analysis), but to qualitatively describe a plausible future object

[2.2.7] Summarizing the ML-oriented approach • Traditional statistical methods • Oriented towards numerical characterization of a data set • Used for globally characterizing a given class of objects • Machine learning methods • Primarily oriented towards symbolic logic-style descriptions of data • Can determine the description for predicting class membership of future objecs But Multi-strategy approach combining the above two is necessary, since different type of questions require different exploratory strategies.

[2.3] Classification of data exploration tasks How to use the GDT(General Data Table) to relate Machine learning techniques to data exploration problems? (1) Learning rules from examples 하나의 discrete attribute를 output attribute로 하고 나머지 attributes를 input으로 하여 주어진 set of rows를 training samples로 하여 이들간의 relationship(rule)을 구한다. => 모든 attributes들에 대해 적용할 수 있다. (2) Determining tree-dependent patterns Detection of temporal patterns in sequences of data arranged along the true dimension in a GDT. Using Multi-model method for qualitative prediction Temporal constructive induction technique (3) Example selection Select rows from the table corresponding to the most representative examples of different classes.

Continued (4) Attribute selection Feature selection이라고도 하며 least relevant attributes to the learning에 해당하는 column을 제거한다. 주로 Gain ration나 Promise level과 같은 attribute selection 기법을 이용한다. (5) Generating new attributes 앞에서 설명한 Constructive induction에 의해 초기에 주어진 attribute으로부터 새로운 relevant attributes를 생성한다. (6) Clustering 역시 앞에서 설명한 Conceptual clustering에 의해 rows of the GDT를 목적하는 group(cluster)로 partition한다. => 이 결과로 나온 cluster를 기술하는 Rule은 Knowledge base에 저장된다. (7) Determining attribute dependencies Determine relationships(e.g., correlation, causal dependencies, logical dependencies) among attributes(column) using statistical/logical methods

Continued (8) Incremental rule update Update the working knowledge(rules) to accommodate new information (9) Searching for approximate patterns in the (imperfect) data Determine the best hypothesis that accounts for most of the available data (10) Filling in missing data Determine the plausible values of the missing entities through the analysis of the currently available data (11) Determining decision structures for declarative knowledge(Decision rules) 주어진 data set(GDT)에 대한 general decision rule이 가정되었을 때 새로운 case에 대한 예측을 위해 사용되기 위해서는 decision tree(or decision structure)의 형태로 변환하는 것이 바람직하다.

Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach