Tools for Data Preparation

Tools for Data Preparation November 8, 2002

Why Data Preparation? Source: D Pyle, Data Preparation for Data Mining, 1999

Data Preparation Process Data Selection Data Cleaning New Data Construction Data Formatting

Data Selection Based on The Following Criteria: • Data quality properties: completeness and correctness • Technical constraints such as limits on data volume or data type: related to data mining tools

Data Cleaning Possible Techniques for Data Cleaning: • Data normalization. e.g., decimal scaling into the range (0,1) by mapping, or standard deviation normalization. • Data smoothing. e.g. Discretization of numeric attributes, this is helpful or even necessary for logic based methods.

Data Cleaning Cont’d • Treatment of missing values. Predict missing values & replace them with the least biased values. e.g. Preserve the relationship between variables. • Data Reduction. The most usual step: examine the attributes and consider their predictive potential. e.g. attribute selection from means and variances, merging features using linear transform.

Data Missing Example

New Data Construction Constructive Operations on Selected Data Include: • Derivation of new attributes from the existing attributes. • Generation of new records. • Data Transformation. • Merging Tables. • Aggregation: Summarizing information from multiple records and/or tables.

Data Formatting It Involves Syntactic Modification Required by Modeling Tools: • Reordering of the attributes or records. • Changes related to the constraints of the modeling tools: e.g. removing comma or tabs, trimming strings to maximum allowed number of characters, replacing special characters with allowed set of special characters.

Data Preparation Tools • Data Junction Integration Studio- http://www.datajunction.com/ • SPSS Base 11.5 - http://www.spss.com/ • Informatica PowerCenter - http://www.informatica.com/ • WizWhy -http://www.wizsoft.com/

Data Junction Integration Studio It includes five visual design tools: • Process Designer • Full conditional flow control • Testing of global variables • Execution of external processes and a complete expression language allow for automation of complex event-driven or scheduled routines • Multi-threaded Integration Engine

Data Junction Integration Studio Cont’d • Map Designer • Mapping source data to target structures • Defining rules for mapping complex hierarchical structures • Define complex rules for record filtering • Error and reject record handling • Error logging

Data Junction Integration Studio Cont’d • Metadata Query • Allows users to run queries against the Data Junction Metadata Repository • Record Layout Designer • A visual tool for defining or modifying data structures (including field names, sizes, length, offset, data types, etc.) for both sources and targets

Data Junction Integration Studio Cont’d • Universal Data Browser • Allows users to view files other than the sources and targets involved in a current design session • View data formats from applications not installed on the system

SPSS Base 11.5 Data Preparation Components Data Editor: a spreadsheet-like system for defining, entering, editing and displaying data Data preparation tools: get data ready for analysis. The Define Variable Properties tool to easily set up data dictionary information (such as value labels, variable labels and variable types) as a "template" so it can be applied to other data files and to other variables within the same file. Apply the dictionary information using the Copy Data Properties tool.

SPSS Base 11.5 Cont’d Data Restructure Wizard: take a data file that has multiple records per subject and restructure it — so data for each subject are in a single record. No need to set up vectors or loops. Particularly helpful with transactional data. Can also do the reverse action — that is, take data from a single record and spread it across multiple cases.

SPSS Base 11.5 Cont’d Data transformations: work with combined data more reliably by "flipping" responses — so all the data are in the same direction. e.g. Help to create multiple-item indices when working with surveys that ask respondents to give both positively worded and negatively worded responses. And other transformation capabilities: such as conditional transformation, compute new variables & recode values

WizWhy Features: • Performs Boolean as well as multi-value analysis • Analyzes the data by discovering all the if-then rules • Reveals necessary and sufficient conditions (if-and-only- if rules) • Calculates the error probability of each rule • Reveals the interesting phenomena in the data by uncovering the unexpected rules

WizWhy Features cont’d • Predicts new cases on the basis of the discovered rules • Explains predictions by listing relevant rules • Calculates the prediction’s conclusive probability and error probability • Predictions are based on error costs (a cost of a miss vs. false alarm) and not influenced by subjective choices • Points out cases deviating from the discovered rules • Proven to be faster and more accurate than other data mining methods

WizWhy Rules Report Example 1) CUSTOMER starts with MORGA if and only if KEY is 985 The rule exists in 32 records. Significance Level: Error probability is almost 0

Tools for Data Preparation