Data Mining for Scientific & Engineering Applications

Data Mining for Scientific & Engineering Applications Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify Chandrika Kamath, Lawrence Livermore National Laboratory Vipin Kumar, Army High Performance Research Center, University of Minnesota

Chapter 10 – Data Mining Systems Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify

Goals of Chapter 10 • What are the four critical interfaces in a data mining system? • Is data mining about rows or columns? • What are the standards in data mining? • What data mining systems are available?

Outline 10.1 Overview of Data Mining Systems 10.2 Case Study Using a System 10.3 Managing Data for Data Mining 10.4 Data Mining Standards 10.5 Commercial and Open Source Systems

10.1 Overview of Data Mining Systems Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

Second Generation First Generation Data mining algorithms Data mining algorithms Data management Fourth Generation mobile data Third Generation Agents & & Internet Predictive modeling Predictive modeling Data mining algorithms Data mining algorithms NGI Data management Data management Four Generations of DM Systems

agents agents Internet pred. models pred. models data mining data mining NGI with QoS data management data management Layered Systems for DM & PM Move results and metadata: Other protocols and services … • Agents can move metadata around via net • Warehouse can move data around via NGI Move models: Predictive Model Markup Language (PMML) Move data: DSTP, distributed databases, etc.

Phases in the Data Mining & Predictive Modeling Process Phase B, C: Warehousing Phase E: Predictive Modeling Phase D: Data Mining Data Mining Mart Learning set PM or Rule Set Data Mining Trans -formations (DXML) Predictive Model Markup Language (PMML) Data Mining Primitives (DMP) Operational data PM rule or Rule Set Scores Phase F: Deployment

Four Critical Interfaces • Data Mining Transformation (DXML) • Interface between operational data and data mining mart • Data Mining Primitives (DMP) • interface between data mining mart and data mining system • Data Mining Application Interface (DM-API) • interface between data mining applications and data mining system, DMQL, OLE DB for Data Mining, … • Predictive Model Markup Language (PMML) • interface between data mining system and predictive modeling system

10.2 Building a Model

Some (Selected) Steps to Build Models • Define the data schema • Clean and load the data • Define the mining schema • Compute derived attributes • Build the model • Analyze the model • Deploy the model

1. Define the Data Schema Data Types: int, double, float, date-time, string, etc.

Select data schema. Select data source: text, database, etc. 2. Clean and Load the Data

3. Define the Mining Schema Select mining role: dependent, independent, excluded, key, etc. Select mining type: continuous, ordinal, categorical, binary, etc.

4. Compute Derived Attributes Define petal_length/sepal_length

5. Build the Model Select Data Store Select Mining Schema Select Parameters

5. Build the Model (cont’d) Classification tree.

5. Build the Model - Tuning Select Model Select Parameters

6. Analyze the Model Analyze how well the model predicted class labels.

7. Deploy the Model Move PMML files to scoring engine.

10.3 Physical Data Management Arranging data by record and by attribute; data mining primitives.

B+ Trees • The cost to access one record is exactly the same as to access a block of records • Use variants of techniques from databases to lower the cost of accessing out of memory data • There are a variety of tree-based methods for efficiently indexing blocks of data, such as B+ trees

Select all objects where is less than 10. Select all objects where is less + than 10. Horizontal vs. Vertical terabye of complex objects Vertical Horizontal

Thinking about Columns NC Mb/s GB Sec Events/s 1 3 4.4 11775 64 4 10 4.4 3590 655 8 17 4.4 2132 2811 16 23 4.4 1551 7731 Horizontal 1 1 0.27 1549 400 4 4 0.27 566 4377 8 7 0.27 320 15482 16 10 0.27 223 44590 Vertical

Data Mining Primitives • For many algorithms, data infrastructure only needs to supply: (Attribute Id, Attribute Value, Class Value, Count) • Specialized data structures can be created to do this. • SQL databases can be extended to do this.

10.4 Data Mining Standards See www.dmg.org for more information. Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

Predictive Model Markup Language (PMML) • Current Version 2.0 • Products shipping with PMML Version 1.1 • PMML Working Group Full Members • IBM, Magnify, MineIt, NCR, Oracle, Salford Systems, SPSS, xChange, University of Illinois at Chicago • PMML Working Group Supporting Members • Angoss, Insightful, KXEN, Microsoft, SGI … • Part of xml.org Repository & Source Forge

agents agents Internet pred. models pred. models data mining data mining NGI with QoS data management data management Layered Systems for DM & PM • Agents can move metadata around via net • Warehouse can move data around via NGI Move models: Predictive Model Markup Language (PMML)

Point of View data mining algorithm • View data mining: • 1. Extract a learning set from a data warehouse • 2. Apply a data mining algorithm • 3. To produce a statistical model, data mining model or rule set. <PMML version=“1.1” <TreeModel ModelName=“response” etc. <Node frequency=“freq_12_month"> etc. </TreeModel> </PMML>

Problems with Current Techniques • Models are deployed in proprietary formats • Models are application dependent • Models are system dependent • Models are architecture dependant • Time required to integrate models with other applications can be long.

partition 1 partition 2 partition 3 High Performance Data Mining & PMML 1. Scatter the query. 2. Compute the classifiers independently. PMML 3. Gather and merge the PMML files

Combine Data Mining System Data Mining System Predictive Modeling System Data Warehouse Data Warehouse Data - Chicago Data - Amsterdam Distributed DM & PMML PMML

Example: PMML <TreeModel modelName="golfing"> <MiningSchema> <MiningField name="temperature"/> <MiningField name="humidity"/> … </MiningSchema> <Node score="play"> <Predicate field="outlook" operator="equal" value="sunny"/> <Node score="play"> <CompoundPredicate booleanOperator="and" > <Predicate field="temperature“ operator="lessThan" value="90F" />

Predictive Model Markup Language (PMML) • Based on XML • Benefits of PMML • Open standard for Data Mining & Statistical Models • Not concerned with the process of creating a model • Provides independence from application, platform, and operating system • Simplifies use of data mining models by other applications (consumers of data mining models)

Philosophy • Very important to understand what PMML is not concerned with … • PMML is a specification of a model, not an implementation of a model • PMML allows a simple means of binding parameters to values for an agreed upon set of data mining models & transformations • Also, PMML includes the metadata required to deploy models

PMML Document Structure • PMML Documents • Data dictionary • Transformation dictionary • One or more PMML models • Support for taxonomies/hierarchies • PMML Model • Mining Schema • Univariate statistics (ModelStats) • Optional extensions

PMML Consumers Operational Data PMML models derivedFields miningFields Campaign Manager derivedFields campaigns PMML Producers,Consumers, & Data Flow PMML Producers Data Mining System learning sets miningFields Data Mining Warehouse dataFields

Data Flow - Recap • Data Dictionary defines data • Mining Schema defines specific inputs (MiningFields) required for model • Transformation Dictionary defines optional additional derived fields • Two types of attributes: • attributes defined by the mining schema • derived attributes defined via transformations • Models themselves can also support certain transformations

Models in PMML v2.0 • polynomial regression • general regression • trees • center based clusters • density based clusters • associations • neural nets • logistic regression • naïve Bayes • sequences

Conformance • Producer conformance • In case, an application can write valid PMML documents for at least one type of model • Consumer conformance • In case an application can read valid PMML documents for at least one type of model • Core and non-core features • For a given model, certain features are identified as core by the DTD and must be supported • Others are identified as optional

OMG CWM DM SQL/MM Pt. 6 DM Object model for representing data mining metadata: models, model results (UML/DTD/XML) SQL objects for defining, creating, and applying data mining models, and obtaining their results (SQL) DMG PMML Representation of data mining models for inter- vendor exchange (DTD/XML) JSR-073 JDMAPI Java API for defining, creating, and applying data mining models, and obtaining their results (Java) SQL-like interface for data mining operations (OLE DB/SQL) OLE DB for DM Other Data Mining Standards

10.5 Commercial & Open Source Systems What do you do when you get home?

Data Mining and Related Systems • SAS • SPSS • Splus (open source R) • Matlab (open source Octave) • Many other specialized systems

References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, 2001 - a good introduction to data mining from the systems and database perspective. Ian H. Witten and Eibe Frank, Data Mining, Morgan Kaufmann Publishers, San Francisco, 2000 - a good introduction which includes Java tools for the common algorithms. Ian H. Witten, Alistair Moffat and Timothy C. Bell, Managing Gigabytes, Second Edition, Morgan Kaufmann, San Diego, 1999 - a good book describing the infrastructurre and theory required for working with large collections of text or images. J. R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kauffmann, San Mateo, California, 1993. Predictive Model Markup Language (PMML), see www.dmg.org

Data Mining for Scientific & Engineering Applications