Introduction to SPSS Modeler (1) Data Preprocessing

Introduction to SPSS Modeler (1)Data Preprocessing Department of Computer and Information Science Fordham University

Working with SPSS Modeler • A three-step process of working with data. • Read data into SPSS Modeler. • Run the data through a series of manipulations. • Send the data to a destination. • This sequence of operations is known as a data stream because the data flows record by record from the source through each manipulation and, finally, to the destination--either a model or type of data output.

Streams, Outputs, Models Manager Stream Canvas Project Window Palettes Nodes

Stream Canvas • Streams are created by drawing diagrams of data operations relevant to your problem on the main canvas in the interface. Each operation is represented by an icon or node, and the nodes are linked together in a stream representing the flow of data through each operation. • You can work with multiple streams at one time in SPSS Modeler, either in the same stream canvas or by opening a new stream canvas. During a session, streams are stored in the Streams manager, at the upper right of the SPSS Modeler window.

Nodes Palette • Most of the data and modeling tools in IBM SPSS Modeler reside in the Nodes Palette, across the bottom of the window below the stream canvas. • To add nodes to the canvas, double-click icons from the Nodes Palette or drag and drop them onto the canvas. You then connect them to create a stream, representing the flow of data.

Modeler Managers • At the top right of the window is the managers pane. This has three tabs. • Streams tab : open, rename, save, and delete the streams created in a session. • Outputs tab : display, save, rename, and close the tables, graphs, and reports, which are produced by stream operations. • Models tab is the most powerful of the manager tabs. This tab contains all model nuggets, which contain the models generated in SPSS Modeler, for the current session. These models can be browsed directly from the Models tab or added to the stream in the canvas.

Project Pane • On the lower right side of the window is the project pane, used to create and manage data mining projects (groups of files related to a data mining task). There are two ways to view projects you create • The Classes view • The CRISP-DM view (recommended)

Create a Stream • To build a stream that will create a model, we need at least three elements: • A source node that reads in data from some external source. • A source or Type node that specifies field properties, such as measurement level (the type of data that the field contains), and the role of each field as a target or input in modeling. • A modeling node that generates a model nugget when the stream is run.

Source Nodes • Source nodes enable you to import data stored in a number of formats, including flat files, IBM SPSS Statistics (.sav), SAS, Microsoft Excel, and ODBC-compliant relational databases. You can also generate synthetic data using the User Input node.

Excel Source Node • The Excel source node enables you to import data from any version of Microsoft Excel.

Excel Source Node • File type. Select the Excel file type that you are importing. • Import file. Specifies the name and location of the spreadsheet file to import. • Choose worksheet. Specifies the worksheet to import, either by index or by name. • By index. Specify the index value for the worksheet you want to import, beginning with 0 for the first worksheet, 1 for the second worksheet, and so on. • By name. Specify the name of the worksheet you want to import. Click the ellipses button (...) to choose from the list of available worksheets.

Excel Source Node • Range on worksheet. You can import data beginning with the first non-blank row or with an explicit range of cells. • Range starts on first non-blank row. Locates the first non-blank cell and uses this as the upper left corner of the data range. • Explicit range of cells. Enables you to specify an explicit range by row and column. For example, to specify the Excel range A1:D5, you can enter A1 in the first field and D5 in the second (or alternatively, R1C1 and R5C4). All rows in the specified range are returned, including blank rows. • On blank rows. If more than one blank row is encountered, you can choose whether to Stop reading, or choose Return blank rows to continue reading all data to the end of the worksheet, including blank rows. • First row has column names. Indicates that the first row in the specified range should be used as field (column) names. If not selected, field names are generated automatically.

Type Node – Field Ops • Field properties can be specified in a source node or in a separate Type node. The functionality is similar in both nodes. • Type node should be connected with the source node.

Measurement Level • Measurement level (formerly known as "data type" or "usage type") describes the usage of the data fields. The measurement level can be specified on the Types tab of a source or Type node. For example, you may want to set the measurement level for an integer field with values of 1 and 0 to Flag. This usually indicates that 1 = True and 0 = False.

Measurement Level • Default • Data whose storage type and values are unknown (for example, because they have not yet been read) are displayed as <Default>. • Continuous • Used to describe numeric values, such as a range of 0–100 or 0.75–1.25. A continuous value can be an integer, real number, or date/time. • Categorical • Used for string values when an exact number of distinct values is unknown. This is an uninstantiated data type, meaning that all possible information about the storage and usage of the data is not yet known. • Once data have been read, the measurement level will be Flag, Nominal, or Typeless, depending on the maximum number of members for nominal fields specified in the Stream Properties dialog box.

Measurement Level • Flag • Used for data with two distinct values that indicate the presence or absence of a trait, such as true and false, Yes and No or 0 and 1. The values used may vary, but one must always be designated as the "true" value, and the other as the "false" value. Data may be represented as text, integer, real number, date, time, or timestamp. • Nominal • Used to describe data with multiple distinct values, each treated as a member of a set, such as small/medium/large. Nominal data can have any storage — numeric, string, or date/time.

Measurement Level • Ordinal • Used to describe data with multiple distinct values that have an inherent order, e.g. salary categories or satisfaction rankings. • The order is defined by the natural sort order of the data elements, e.g. 1, 3, 5, while HIGH, LOW, NORMAL (ascending alphabetically). • Typeless • Used for data that does not conform to any of the above types, for fields with a single value, or for nominal data where the set has more members than the defined maximum. It is also useful for cases in which the measurement level would otherwise be a set with many members (such as an account number).

Auto Data Prep – Field Ops • Automated Data Preparation (ADP) handles the task for preparing data for analysis. • analyzing your data • identifying fixes • screening out fields that are problematic or not likely to be useful • deriving new attributes when appropriate, • and improving performance through intelligent screening techniques. • Using ADP enables you to make your data ready for model building quickly and easily, without needing prior knowledge of the statistical concepts involved.

Auto Data Prep – Field Ops • when ADP prepares a field for analysis, it creates a new field containing the adjustments or transformations, rather than replacing the existing values and properties of the old field. The old field is not used in further analysis; its role is set to None.

Data Audit Node - Output • The Data Audit node provides a comprehensive first look at the data you bring into IBM SPSS Modeler, presented in an easy-to-read matrix that can be sorted and used to generate full-size graphs and a variety of data preparation nodes.

Perform Data Audit

Statistics and Charts

Data Quality • The Quality tab in the audit report displays information about outliers, extremes, and missing values.

Missing Values SuperNode • After specifying an impute method for one or more fields, to generate a Missing Values SuperNode, from the menus choose: • Generate > Missing Values SuperNode • Within the SuperNode, a combination of model nugget, Filler, and Filter nodes is used as appropriate. To understand how it works, you can edit the SuperNode and click Zoom In, and you can add, edit, or remove specific nodes within the SuperNode to fine-tune the behavior.

Generate Filter Node • Alternatively, you can generate a Select or Filter node to remove fields or records with missing values. For example, you can filter any fields with a quality percentage below a specified threshold.

Outlier and Extreme SuperNode • Outliers and extreme values can be handled in a similar manner. Specify the action you want to take for each field—either coerce, discard, or nullify—and generate a SuperNode to apply the transformations.

Outliers and Extreme Values • Standard deviation from the mean • Detects outliers and extremes based on the number of standard deviations from the mean. For example, if you have a field with a mean of 100 and a standard deviation of 10, you could specify 3.0 to indicate that any value below 70 or above 130 should be treated as an outlier. • Interquartile range • Detects outliers and extremes based on the interquartile range, which is the range within which the two central quartiles fall (between the 25th and 75th percentiles). For example, based on the default setting of 1.5, the lower threshold for outliers would be Q1 – 1.5 * IQR and the upper threshold would be Q3 + 1.5*IQR. Note that using this option may slow performance on large datasets.

Handling Outliers and Extreme Values • Action • Coerce • Replaces outliers and extreme values with the nearest value that would not be considered extreme. • Example: if an outlier is defined to be anything above or below three standard deviations, then all outliers would be replaced with the highest or lowest value within this range. • Discard • Discards records with outlying or extreme values for the specified field. • Nullify • Replaces outliers and extremes with the null or system-missing value. • Coerce outliers / discard extremes • Discards extreme values only. • Coerce outliers / nullify extremes • Nullifies extreme values only.

Zoom In SuperNode

Reset Field in Type Node

Excel Export Node • The Excel export node outputs data in Microsoft Excel format (.xls). Optionally, you can choose to automatically launch Excel and open the exported file when the node is executed.

Binning Node – Field Ops • The Binning node enables you to automatically create new nominal fields based on the values of one or more existing continuous (numeric range) fields.

Binning Techniques • Using the Binning node, you can automatically generate bins (categories) using the following techniques: • Fixed-width binning • Tiles (equal count or sum) • Mean and standard deviation • Ranks • Optimized relative to a categorical "supervisor" field

Select Bin Fields

Bin Values

Binning Result

Filter Node – Field Ops • You can rename or exclude fields at any point in a stream with a filter node.

Introduction to SPSS Modeler (1) Data Preprocessing