G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 1: Introduction and data quantification Some slides taken from “Jiawei Han, Data Mining: Concepts and Techniques. Chapter 2 “

Outline of the lecture • Introduction to data preprocessing (J. Han) • Evaluating preprocessing methods • Looking at data • Statistical quantification methods (J. Han) • Data complexity metrics

Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“” • noisy: containing errors or outliers • e.g., Salary=“-10” • inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

Why Is Data Dirty? • Incomplete data may come from • “Not applicable” data value when collected • Different considerations between the time when the data was collected and when it is analyzed. • Human/hardware/software problems • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission • Inconsistent data may come from • Different data sources • Functional dependency violation (e.g., modify some linked data) • Duplicate records also need data cleaning

Why Is Data Preprocessing Important? • No quality data, no quality mining results! • Quality decisions must be based on quality data • e.g., duplicate or missing data may cause incorrect or even misleading statistics. • Data warehouse needs consistent integration of quality data • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility • Broad categories: • Intrinsic, contextual, representational, and accessibility

Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data

Forms of Data Preprocessing

Evaluating preprocessing methods • A typical data mining pipeline would be…. Training set Classification method Dataset Cross-validation model Test set Validation Accuracy Where would you put the preprocessing?

Evaluating preprocessing methods • Right at the beginning? Training set Classification method Dataset Preprocess-ing Cross-validation model Test set Problem with this: As the whole dataset is used for the preprocessing, there is danger that information can leak from the training to the test set Validation Accuracy

Evaluating preprocessing methods • A better way Training set Preprocessing Classification method Dataset Cross-validation Preprocessing model model Test set Filtered test set Validation Accuracy This is called external cross-validation

Mining Data DescriptiveCharacteristics • Motivation • To better understand the data: central tendency, variation and spread • Data dispersion characteristics • median, max, min, quantiles, outliers, variance, etc. • Numerical dimensions correspond to sorted intervals • Data dispersion: analyzed with multiple granularities of precision • Boxplot or quantile analysis on sorted intervals • Dispersion analysis on computed measures • Folding measures into numerical dimensions • Boxplot or quantile analysis on the transformed cube

Measuring the Central Tendency • Mean (algebraic measure) (sample vs. population): • Weighted arithmetic mean: • Trimmed mean: chopping extreme values • Median: A holistic measure • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): • Mode • Value that occurs most frequently in the data • Unimodal, bimodal, trimodal • Empirical formula:

Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data

Measuring the Dispersion of Data • Quartiles, outliers and boxplots • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number summary: min, Q1, M,Q3, max • Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually • Outlier: usually, a value higher/lower than 1.5 x IQR • Variance and standard deviation (sample:s, population: σ) • Variance: (algebraic, scalable computation) • Standard deviation s (or σ) is the square root of variance s2 (orσ2)

Properties of Normal Distribution Curve • The normal (distribution) curve • From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) • From μ–2σ to μ+2σ: contains about 95% of it • From μ–3σ to μ+3σ: contains about 99.7% of it

Boxplot Analysis • Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ • The median is marked by a line within the box • Whiskers: two lines outside the box extend to Minimum and Maximum

Visualization of Data Dispersion: Boxplot Analysis

Histogram Analysis • Graph displays of basic statistical class descriptions • Frequency histograms • A univariate graphical method • Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

Quantile Plot • Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) • Plots quantile information • For a data xidata sorted in increasing order, fiindicates that approximately 100 fi% of the data are below or equal to the value xi

Quantile-Quantile (Q-Q) Plot • Graphs the quantiles of one univariate distribution against the corresponding quantiles of another • Allows the user to view whether there is a shift in going from one distribution to another

Scatter plot • Provides a first look at bivariate data to see clusters of points, outliers, etc • Each pair of values is treated as a pair of coordinates and plotted as points in the plane

Loess Curve • Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence • Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression

Positively and Negatively Correlated Data

Not Correlated Data

Pitfalls of correlation • http://en.wikipedia.org/wiki/Anscombe%27s_quartet • These four datasets have the same correlation value

Graphic Displays of Basic Statistical Descriptions • Histogram: (shown before) • Boxplot: (covered before) • Quantile plot: each value xiis paired with fi indicating that approximately 100 fi % of data are xi • Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane • Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence

Complexity metrics for classification methods • Work proposed by (Basu and Ho, 2002). • Implementation of these metrics • The previous section focused on extracting characteristics from the dataset alone. • This set of complexity metrics has a different approach. It aims at answering this question: • What makes this dataset difficult to learn?

A simple example Imagine that you got a dataset like this. How would you classify it? Linear classifier Rule learning Certain problems are easier for some Knowledge representations than others Is it possible to generate metrics that quantify these difficulties?

Some of the aspects being quantified Degree of linear separability Length of the class boundary by means of spanning trees Shape of class manifolds (Luengo, 2011)

Groups of metrics: • Measures of overlap of individual features • Quantifies the discrimination power of individual attributes of a dataset • Measures of separability of classes • Multivariate, considering together all attributes • Measures of geometry, topology and density of manifolds • These metrics analyse the shape of the class boundary using different kind of techniques

Measures of overlap of individual features • Fisher’s discriminant ratio (F1) • Assesses the overlap between the distributions (normal) of values for each class • F1 = score of best attribute (maximum score) • Volume of overlap region (F2) • Feature efficiency (F3) • Percentage of examples from the dataset that are out of the overlap region for the best att.

Measures of separability of classes • Linear separability (L1,L2) • Construct a linear classifier for the problem • L1 = sum of distances from line/hyperplane to misclassifier examples • L2 = rate of misclassified examples • Mixture identifiability (N1, N2, N3) • N1: Compute MST. Count rate of edges across classes • N2: Check nearest neighbours within and outside the class for each instance. Average measures and divide them • N3: Error rate of 1-NN classifier using leave-one-out

Measures of geometry, topology and density of manifolds • Non-linearity of a linear classifier (L3) or of a nearest neighbour classifier (N4) • L3 and N4 create a new test set interpolating randomly chosen instances from the same class, and test the accuracy of the L1 and N3 classifiers • Space covering by ε-neighbourhoods (T1) • Ratio of instances/features (T2)

So how can we use these metrics? • Application of these metrics to study the performance of Fuzzy Rule Based Classification Systems (paper) • A fuzzy classifier called “Fuzzy Hybrid Genetics-Based Machine Learning” (FH-GBML) was evaluated on a very large set of 450 datasets. The training and test accuracy in each of these datasets was ranked

Ranking the performance using the metrics

And some rules can be extracted…

Considerations about complexity metrics • Can they predict the performance of classification methods? • So far results have not been very conclusive • Past experience of the meta-learning community • Scalability issues • Several of these metrics cannot be computed for large-scale datasets in reasonable time

Questions?

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT