Data Preprocessing

Data Preprocessing

Data – Things to consider • Type of data: determines which tools to analyze the data • Quality of data: • Tolerate some levels of imperfection • Improve quality of data improves the quality of the results • Preprocessing: modify the data to better fit data mining tools: • Change length into short, medium, long • Reduce number of attributes

Data • Collection of objects or records • Each object described by a number of attributes • Attribute: property of the object, whose value may vary from object to object or from time to time

What kind of Data? • Any data as long as it is meaningful for the target application • Database data • Data warehouse data • Data streams • Sequence data • Graph • Spatial data • Text data

Database data • Data store in a database management system (DBMS) • Collection of interrelated data and a software system to manage and access the data • Relational model: • Customer (id, firstname, lastname, address, city, state, phone) • Product (id, description, weight, unit) • Order(id, order_number, customer_id, product_id, quantity, price)

Data Warehouses • Data collected from multiple data sources • Stored under a unified schema • Usually residing at a single site • Provide historical information • Used for reporting

Data Streams • Sensor data: collecting gps/enviroment data and sending reading every tenth of a second • Image data: satellite data, surveillance cameras • Web traffic: a node on the Internet receives streams of IP packets Output stream 1, 3, 2, 5, 7, 0, 1 Stream Processor a, r, b, c, a, l, u, 1, 0, 1, 1, 1, 0, 0, 1 Archive

Sequence Data • Data obtained by: • connecting all records associated with a certain object together • and sorting them in increasing order of their timestamps • for all objects • Examples: • Web pages visited by a user (object): • {<Homepage>, <Electronics>, <Cameras and Camcorders>, <Digital Cameras>, …, <Shopping Cart>, <Order Confirmation>}, {….} • Transactions made by a customer over a period of time: • {t1, t18, t500, t721}, {t11, t38, t43, t621, t3005}

Graph Data • Data structure represented by nodes (entities) and edges (relationships) • Example: • Protein subsequences • Web pages and links a e b c d

Attribute Types • Nominal: Differentiates between values based on names • Gender, eye color, patient ID • Ordinal: Allows a rank order for sorting data but does not describe the degree of difference • {low, medium, high}, grades {A, B, C, D, F}, • Interval: Describes the degree of difference between values • Dates, temperatures in C and F • Ratio: Both degree or difference and ratio are meaningful • Temperatures in K, lengths, age, mass, …

Attribute Properties • Distinctness: = and ≠ • Order: <, ≤, ≥, and > • Addition: + and - • Multiplication: * and /

Transformations

Discrete and Continuous Attributes • Discrete Attributes: • Finite or countably infinite set of values • Categorical (zipcode, empIDs) or numeric (counts) • Often represented as integers • Special Case: binary attributes (yes/no, true/false, 0/1) • Continuous Attributes: • Real numbers • Examples: temperatures, height, weight, … • Practically, can be measured with limited precision

Asymmetric Attributes • Only presence of attribute is considered important • Can be binary, discrete or continuous • Examples: • Courses taken by students • Items purchased by customers • Words in document

Handling non record data • Most data mining algorithm are designed for record data • Non-record data can be transformed into record data Example: Input: Chemical structures data Expert knowledge: common substructures Transformation: For each compound, add a record with common substructures as binary attributes

Handling non record data May lose some information during transformation Example: spacio-temporal data: time series for each data on a grid Are T1 and T2 independent? L1 and L2? The transformation does not capture time/location relationships space time space

Data quality Issues • Data collected for purposes other than data mining • May not be able to fix quality issues at the source • To improve quality of data mining results: • Detect and correct the data quality issues • Use algorithms that tolerate poor data quality

Data quality Issues • Caused by human errors, limitations of measuring devices, flaws in the data collection process, duplicates, missing data, inconsistent data • Measurement error: the value recorded is different from true value • Data collection error: an attribute/object is omitted or an object is inappropriately included.

Noise • Random component of a measurement error • Distortion of the value or addition of spurious objects

Artifacts • Deterministic distortion of the value • Example: streak in the same place of a set of photograph

Outliers • Data objects with characteristics different than most other data objects • Attributes values unusual with respect typical values • Legitimate objects or values • Applications: Intrusion detection, fraud detection

Missing Values • Not collected • Not applicable

Missing Values • Eliminate data objects or attributes: • Too many eliminated objects make the analysis unreliable • Eliminated attributes may be critical to the analysis • Estimate missing values • Regression • Average of nearest neighbors (if continuous) • Mean for samples in same class • Most common attribute (if categorical)

Inapplicable Values • Categorical Attributes: introduce special value ‘N/A’ as a new category • Continuous Attributes: introduce special value outside the domain for example -1 for sales commission

Inconsistent Data • Inconsistent Values: • Entry error, reading error, multiple sources • Negative values for weight • Zip code and city not matching • Correction:requires information from external source or redundant information

Duplicate Data • Duplicate Data • Deduplication: the process of dealing with duplicate data issues • Inconsistencies in duplicate records • Identify duplicate records

Deduplication Methods • Data preparation step: Data transformation and standardization • Transform attributes from one data type or form to another, rename fields, standardize units, … • Distance based: • Edit distance: how many edits (insert/delete/replace) needed to transform one string to another • Affine gap distance: edits include extend gap/open gap • Smith-Waterman Distance: gaps at beginning and end have smaller costs than gaps in the middle John R. Smith Johnathan Richard Smith • * Duplication Record Detection: A survey”, TKDE 2007

Deduplication Methods • Clustering based: extends distance/similarity-based beyond pairwise comparison

Deduplication Methods • Classification based: • Create a training set of duplicate and non-duplicate pairs of records • Train a classifier to distinguish duplicate from non-duplicate pairs • L. Breiman, L. Friedman, and P. Stone, (1984). Classification and Regression. Wadsworth, Belmont, CA. • Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984 • R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, 1994. • RakeshAgrawal and RamakrishnanSrikant. Fast Algorithms for Mining Association Rules In Proc. Of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 • * Sarawagiand Bhamidipaty, Interactive Deduplication using Active Learning, KDD 2002

Deduplication Methods • Crowdsourcing-based • Uses a combination of human worker and machine to identify duplicate data records • Given n records, there are n(n-2)/2 pairs to detect for duplicates • Apply classification approach first to filter pairs that, with high probability, are non-duplicates; use human workers to label the remaining (uncertain) pairs • Examples of crowdsourcing services include Amazon Mechanical Turk and Crowdflower * Wang et al., “CrowdER: Crowdsourcing Entity Resolution”, VLDB 2012

Application Related Issues • Timeliness: • Some data age quickly after it is collected • Example: Purchasing behavior of online shoppers • All patterns/models based on expired data are out of date • Relevance: • Data must contain relevant information to the application • Knowledge about the data • Correlated attributes • Special values (Example: 9999 for unavailable) • Attribute types (Example: Categorical {1, 2, 3})

Data Preprocessing • Aggregation: combine multiple objects into one • Reduce number of objects: less memory, less processing time • Quantitative attributes are combined using sum or average • Qualitative attributes are omitted or listed as a set • Sampling: • select a subset of the objects to be analyzed • Discretization: • transform values into categorical form • Dimensionality reduction • Variable transformation

The Curse of Dimensionality • Data sets with large number of features • Document matrix • Time series of daily closing stock prices

The Curse of Dimensionality • Data analysis becomes harder as the dimensionality of data increases • As dimensionality increases, data becomes increasingly sparse • Classification: not enough data to create a reliable model • Clustering: distances between points less meaningful

Why Feature Reduction? • Less data so the data mining algorithm learns faster • Higher accuracy so the model can generalize better • Simple results so they are easier to understand • Fewer features for improving the data collection process • Techniques: • Feature selection • Feature composition/creation

Feature Selection • Selecting attributes that are a subset of old attributes • Redundant features: contain duplicate information • (price, tax amount) • Irrelevant features: no useful information • Phone number, employee id • Exhaustive enumeration is not practically feasible • Approaches: • Embedded: as part of the data mining algorithm • Filter: before the data mining algorithm is run • Wrapper approaches: by trying different subsets of attributes

Feature Selection • General Approach: • Generate a subset of features • Evaluate • Using an evaluation function for filter approaches • Using the data mining algorithm for wrapper approaches • If good, stop • Otherwise, update subset and repeat Need: subset generation strategy, a measure for evaluation and a stopping criterion

Principal Component Analysis • Identifies strong patterns in the data • Expresses data in a way that highlights differences and similarities • Reduces the number of dimensions without much loss of information

Principal Component Analysis • First dimension is chosen to capture the variability of the data • Second dimension is chosen orthogonal to the first one so it captures as much of the remaining variability, and so on … • Each pair of new attributes has covariance 0 • The attributes are ordered in decreasing order of the variance in data that they capture • Each attribute: • accounts for max variance not accounted for by previous attributes • is uncorrelated with preceding attributes

PCA – Step 1 • Subtract the mean from each of the data dimensions • Obtain a new data set DM • This produces data with mean 0 Original data matrix Vector of means New data matrix

PCA – Step 2 • Calculate the covariance matrix for DM: cov(DM)

PCA – Step 3 • Calculate the unit eigenvectors and eigenvalues of the covariance matrix • That gives us lines that characterize the data

PCA – Step 4 • Choose components: • The eigenvector with the largest eigenvalue is the principal component of the data set • Order eigenvectors by eigenvalue from largest to smallest, this gives the components by order of significance • Select the first p eigenvectors, the final data will have p dimensions • Construct the feature vector Af= (eigv1 eigv2 … eigvp)

PCA – Step 5 • Derive the new data set: • DataNew = AfTxDM • Expresses the original data solely in terms of the vectors chosen

PCA Assumptions and characteristics • Linearity: frames the problem as change of basis • Large variances have important dynamics • Effective: most variability captured by a small fraction of dimensions • Principal components are orthogonal • Noise becomes weaker

Summary • Data quality affects quality of knowledge discovered • Preprocessing needed: • Improve quality • Transform into suitable form • Issues: • Missing values: not available / not applicable • Inconsistent values • Noise • Duplicate data • Inapplicable form requiring transformation • High dimensionality: PCA

Data Preprocessing