Data Transformation and Feature Selection/Extraction

Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J. Han, Isabelle Guyon, Martin Bachler Data Mining: Concepts and Techniques

Continuous Attribute Temperature

Discretization • Three types of attributes: • Nominal — values from an unordered set • Example: attribute “outlook” from weather data • Values: “sunny”,”overcast”, and “rainy” • Ordinal — values from an ordered set • Example: attribute “temperature” in weather data • Values: “hot” > “mild” > “cool” • Continuous — real numbers • Discretization: • divide the range of a continuous attribute into intervals • Some classification algorithms only accept categorical attributes. • Reduce data size by discretization • Supervised (entropy) vs. Unsupervised (binning) Data Mining: Concepts and Techniques

Simple Discretization Methods: Binning • Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward • But outliers may dominate presentation: Skewed data is not handled well. • Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples Data Mining: Concepts and Techniques

Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension using dynamic programming • Related to quantization problems. Data Mining: Concepts and Techniques

Supervised Method: Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is • The boundary T that minimizes the entropy function over all possible boundaries is selected as a binary discretization. • Greedy Method: • the process is recursively applied when T goes from smallest to largest value of attribute A, until some stopping criterion is met, e.g., for some user-given Data Mining: Concepts and Techniques

How to Calculate ent(S)? • Given two classes Yes and No, in a set S, • Let p1 be the proportion of Yes • Let p2 be the proportion of No, • p1 + p2 = 100% Entropy is: ent(S) = -p1*log(p1) –p2*log(p2) • When p1=1, p2=0, ent(S)=0, • When p1=50%, p2=50%, ent(S)=maximum! Data Mining: Concepts and Techniques

Transformation: Normalization • min-max normalization • z-score normalization • normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1 Data Mining: Concepts and Techniques

Transforming Ordinal to Boolean • Simple transformation allows to code ordinal attribute with n values using n-1 boolean attributes • Example: attribute “temperature” • How many binary attributes shall we introduce for nominal values such as “Red” vs. “Blue” vs. “Green”? Original data Transformed data Data Mining: Concepts and Techniques

Data Sampling Data Mining: Concepts and Techniques

Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew (uneven) classes • Develop adaptive sampling methods • Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data Data Mining: Concepts and Techniques

Raw Data Sampling SRSWOR (simple random sample without replacement) SRSWR Data Mining: Concepts and Techniques

Sampling Example Cluster/Stratified Sample Raw Data Data Mining: Concepts and Techniques

Summary • Data preparation is a big issue for data mining • Data preparation includes transformation, which are: • Data sampling and feature selection • Discretization • Missing value handling • Incorrect value handling • Feature Selection and Feature Extraction Data Mining: Concepts and Techniques

Data Transformation and Feature Selection/Extraction

Data Transformation and Feature Selection/Extraction

Presentation Transcript

Personnel Selection

Data Mining: Preprocessing Techniques

Feature Extraction

Introduction

Organization as Flux and Transformation

Extraction Site Ridge Preservation

Feature selection methods

Toward Unified Models of Information Extraction and Data Mining

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

Information Extraction

Introduction

Extraction Metallurgy

The Genetical Theory of Natural Selection

Introduction

Data Mining: Data Preprocessing

Model Transformation

Design and Implementation of Speech Recognition Systems

Introduction to audio signal processing

Appraisal, Extraction and Pooling of Qualitative Data and Text

Outline

Feature Extraction for speech applications

Rapid Training of Information Extraction with Local and Global Data Views