Introduction

Introduction

Introduction • Data Mining and Knowledge Discovery • Data Mining Methods • Supervised Learning • Unsupervised Learning • Other Learning Paradigms • Introduction to Data Preprocessing

Data Mining and Knowledge Discovery • Vast amounts of data are around us in our world, raw data that is mainly intractable for human or manual applications. • Data Mining (DM) is about solving problems by analyzing data present in real databases. • DM is distinghished as synonym of the Knowledge Discovery in Databases (KDD) process, or as the main step of KDD.

Data Mining and Knowledge Discovery • KDD definition: “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” • Stages: • Problem Specification. • Problem Understanding. • Data Preprocessing. • Data Mining. • Evaluation. • Results Exploitation.

Data Mining and Knowledge Discovery KDD process:

Data Mining Methods

Data Mining Methods • Statistical Methods: • Regression Models: • They are used in estimation tasks, requiring the class of equation modelling to be used. • Linear, quadratic and logistic regression are the most well known. • They may have problems with missing values, outliers and redundant/harmful features.

Data Mining Methods • Statistical Methods: • Artificial Neural Networks (ANNs): • They are powerful mathematical models suitable for almost all DM tasks, especially predictive one. • Multi-layer perceptron (MLP), Radial Basis Function Networks (RBFNs) and Learning Vector Quantization (LVQ) are the most well known. • They require numeric attributes and may have problems with missing values. • They are robust to outliers and noise.

Data Mining Methods • Statistical Methods: • Bayesian Learning: • It uses the probability theory as a framework for making rational decisions under uncertainty. • Naïve Bayes is the most well known technique. • They are very sensitive to the redundancy and usefulness of some of the attributes and examples from the data, together with noisy and outliers examples. • They require nominal attributes and cannot deal with missing values.

Data Mining Methods • Statistical Methods: • Instance-based Learning: • The examples are stored verbatim, and a distance function is used to determine which members of the database are closest to a new example with a desirable prediction. • The K-Nearest Neighbor (KNN) is the most representative method. • They are good candidates to be improved through data reduction procedures.

Data Mining Methods • Statistical Methods: • Support Vector Machines (SVMs): • They are machine learning algorithms based on learning theory and similar to ANNs in the sense that they are used for estimation and perform very well when data is linearly separable. • They require numeric non-missing data and are commonly robust against noise and outliers.

Data Mining Methods • Symbolic Methods: • Rule Learning: • They search for a rule that explains some part of the data, separate these examples and recursively conquer the remaining examples. • AQ, CN2, RIPPER, PART and FURIA are good examples of this family. • They require numeric non-missing data and are commonly robust against noise and outliers. • They require nominal data (sometimes with an implicit process) and dispose of an innate selector of interesting attributes from data.

Data Mining Methods • Symbolic Methods: • Decision Trees: • They construct predictive models formed by iterations of a divide and conquer scheme of hierarchical decisions. • CART, C4.5 and PUBLIC are good examples of this family. • They are closely related to rule learning methods and suffer from the same disadvantage as them.

Data Mining Methods • Data descriptive tasks: • Clustering: • It appears when there is no class information to be predicted but the examples must be divided into natural groups or clusters. • Well known examples of clustering algorithms are k-Means, COBWEB and Self Organizing Maps. • They prefer numeric data together with no-missing data and the absence of noise and outliers.

Data Mining Methods • Data descriptive tasks: • Association Rules: • Set of techniques that aim to find association relationships in the data. • The Apriori technique is the most emblematic technique to address this problem. • Data transformation (mainly discretization) and reduction is often needed to perform high quality analysis in this DM problem.

Supervised Learning • Prediction methods are commonly referred to as supervised learning. Supervised methods are thought to attempt the discovery of the relationships between input attributes and a target attribute. • A training set is given and the objective is to form a description that can be used to predict unseen examples.

Supervised Learning • Problems: • Classification • The domain of the target attribute is finite and categorical. • A classifier must assign a class to a unseen example. • Regression • The target attribute is formed by infinite values. • To fit a model to learn the output target attribute as a function of input attributes. • Time Series Analysis • Making predictions in time.

Unsupervised Learning • There is no supervisor and only input data is available. • The aim is now to find regularities, irregularities, relationships, similarities and associations in the input.

Unsupervised Learning • Problems: • Clustering • Association Rules • Pattern Mining • It is adopted as amore general term than frequent pattern mining or association mining. • Outlier Detection • Ot is the process of finding data examples with behaviours that are very different from the expectation (outliers or anomalies).

Other Learning Paradigms • Imbalanced Learning • A classification problem where the data has exceptional distribution on the target attribute. • The number of examples representing the class of interest is much lower than that of the other classes. • Multi-instance Learning • imposed restrictions on models in which each example consists of a bag of instances instead of an unique instance.

Other Learning Paradigms • Multi-label Classification • Each instance is associated not with a class, but instead with a subset of them. • Semi-supervised Learning • It is concerned with the design of models in the presence of both labeled and unlabeled data. • Semi-supervised classification and Semi-supervised clustering. • Relationship with Active Learning.

Other Learning Paradigms • Subgroup Discovery • It is formed as the result of the hybridization between classification and association mining. • They aim to extract interesting rules with respect to a target attribute. • Transfer Learning • Aims to extract the knowledge from one or more source tasks and apply the knowledge to a target task. • The so-called data shift problem is closely related.

Other Learning Paradigms • Data Stream Learning • When all data is not available at a specific moment, it is necessary to develop learning algorithms that treat the input as a continuous data stream. • Each instance can be inspected only once and must then be discarded to make room for subsequent instances.

Introduction to Data Preprocessing • Unfortunately, real-world databases are highly influenced by negative factors such the presence of noise, missing values, inconsistent and superfluous data and huge sizes in both dimensions, examples and features. • Low-quality data will lead to low-quality DM performance.

Introduction to Data Preprocessing • Forms of Data Preparation

Introduction to Data Preprocessing • Data Cleaning • Correct bad data, filter some incorrect data out of the data set and reduce the unnecessary detail of data. • Data Transformation • The data is consolidated so that the mining process result could be applied or may be more efficient. • Data Integration • Merging of data from multiple data stores.

Introduction to Data Preprocessing • Data Normalization • To express data in the same measurements units, scale or range. • Missing Data Imputation • To fill the variables that contain missing values with some intuitive data. • Noise Identification • To detect random errors or variances in a measured variable.

Introduction to Data Preprocessing Forms of Data Reduction

Introduction to Data Preprocessing • Feature Selection • Achieves the reduction of the data set by removing irrelevant or redundant features (or dimensions). • Instance Selection • Consists of choosing a subset of the total available data to achieve the original purpose of the DM application as if the whole data had been used.

Introduction to Data Preprocessing • Discretization • Transforms quantitative data into qualitative data, that is, numerical attributes into nominal attributes with a finite number of intervals. • Feature Extraction/Instance Generation • Extends both the feature and instance selection by allowing the modification of the internal values that represent each example or attribute.

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction