140 likes | 262 Views
Data Transformation and Feature Selection/Extraction. Qiang Yang Thanks: J. Han, Isabelle Guyon, Martin Bachler. Continuous Attribute Temperature. Discretization. Three types of attributes: Nominal — values from an unordered set Example: attribute “outlook” from weather data
E N D
Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J. Han, Isabelle Guyon, Martin Bachler Data Mining: Concepts and Techniques
Discretization • Three types of attributes: • Nominal — values from an unordered set • Example: attribute “outlook” from weather data • Values: “sunny”,”overcast”, and “rainy” • Ordinal — values from an ordered set • Example: attribute “temperature” in weather data • Values: “hot” > “mild” > “cool” • Continuous — real numbers • Discretization: • divide the range of a continuous attribute into intervals • Some classification algorithms only accept categorical attributes. • Reduce data size by discretization • Supervised (entropy) vs. Unsupervised (binning) Data Mining: Concepts and Techniques
Simple Discretization Methods: Binning • Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward • But outliers may dominate presentation: Skewed data is not handled well. • Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples Data Mining: Concepts and Techniques
Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension using dynamic programming • Related to quantization problems. Data Mining: Concepts and Techniques
Supervised Method: Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is • The boundary T that minimizes the entropy function over all possible boundaries is selected as a binary discretization. • Greedy Method: • the process is recursively applied when T goes from smallest to largest value of attribute A, until some stopping criterion is met, e.g., for some user-given Data Mining: Concepts and Techniques
How to Calculate ent(S)? • Given two classes Yes and No, in a set S, • Let p1 be the proportion of Yes • Let p2 be the proportion of No, • p1 + p2 = 100% Entropy is: ent(S) = -p1*log(p1) –p2*log(p2) • When p1=1, p2=0, ent(S)=0, • When p1=50%, p2=50%, ent(S)=maximum! Data Mining: Concepts and Techniques
Transformation: Normalization • min-max normalization • z-score normalization • normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1 Data Mining: Concepts and Techniques
Transforming Ordinal to Boolean • Simple transformation allows to code ordinal attribute with n values using n-1 boolean attributes • Example: attribute “temperature” • How many binary attributes shall we introduce for nominal values such as “Red” vs. “Blue” vs. “Green”? Original data Transformed data Data Mining: Concepts and Techniques
Data Sampling Data Mining: Concepts and Techniques
Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew (uneven) classes • Develop adaptive sampling methods • Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data Data Mining: Concepts and Techniques
Raw Data Sampling SRSWOR (simple random sample without replacement) SRSWR Data Mining: Concepts and Techniques
Sampling Example Cluster/Stratified Sample Raw Data Data Mining: Concepts and Techniques
Summary • Data preparation is a big issue for data mining • Data preparation includes transformation, which are: • Data sampling and feature selection • Discretization • Missing value handling • Incorrect value handling • Feature Selection and Feature Extraction Data Mining: Concepts and Techniques