470 likes | 638 Views
DAT204 Introduction to Data Mining with SQL Server 2000. ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation. Agenda. What is Data Mining The Data Mining Market OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 Demo Q&A.
E N D
DAT204Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation
Agenda • What is Data Mining • The Data Mining Market • OLE DB for Data Mining • Overview of the Data Mining Features in SQL Server 2000 • Demo • Q&A
What is DM? • A process of data exploration and analysis using automatic or semi-automatic means • Techniques origin from Machine Learning, statistics and database • “Exploring data” – scanning samples of known facts about “cases”. • “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… • Once the “knowledge” is extracted it: • Can be browsed • Provides a very useful insight on the cases behavior • Can be used to predict values of other cases • Can serve as a key element in closed loop analysis
IQ=High IQ=Low Attend College: 79% Yes 21% No Attend College: 45% Yes 55% No Wealth = False Parents Encourage = Yes Parents Encourage = No Wealth = True Attend College: 70% Yes 30% No Attend College: 31% Yes 69% No Attend College: 94% Yes 6% No Attend College: 69% Yes 21% No The deciding factors for high school students to attend college are… All Students Attend College: 55% Yes 45% No IQ ? Wealth Parents Encourage?
Business Oriented DM Problems • Targeted ads • “What banner should I display to this visitor?” • Cross sells • “What other products is this customer likely to buy? • Fraud detection • “Is this insurance claim a fraud?” • Churn analysis • “Who are those customers likely to churn?” • Risk Management • “Should I approve the loan to this customer?” • …
Data To Predict Training Data Mining Model Mining Model Mining Model Predicted Data Mining Process - Illustrated DM Engine DM Engine
The $$$: Market Size • DM Tools Market: • 1999: $341.3M • 2000: $455.1M • 2001: $449.5M * IDC
The Players • Leading vendors • SAS • SPSS • IBM • Angoss • Hundreds of smaller vendors offering DM algorithms… • Oracle –Thinking Machines acquisition
The Products • End-to-end horizontal DM tools • Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. • The customer is the power-analyst • PhD in statistics is usually required… • Closed tools – no standard API • Total vendor lock-in • Limited integration with applications • DM an “outsider” in the Data Warehouse • Extensive consulting required • Sky rocketing prices • $60K+ for a single user license
What the analysts say… • “Stand-alone Data Mining Is Dead” - Forrester • “The demise of [stand alone] data mining” – Gartner
DataPro Users Survey 1999-2001 “Data mining will be the fastest-growing BI technology…”
Market Size of BI * IDC
SQL Server 2000 - The Analysis Platform • SQL 2000 provides a complete Analysis Platform • Not an isolated, stand alone DM product • Platform means: • Standard based DM API’s (OLE DB for DM) for applications development • Integrated vision for all technologies, tools • Extensible • Scaleable
Reports & Analysis Data Flow DM Apps DM OLTP DW OLAP
DM Analysis Services 2000 –Components Tree View Control Manager UI DM Wizards Cluster View Control DM DTS Task Lift Chart Control DSO DMM OLE DB OLAP Sample Query Tool Client Analysis Server OLAP Engine (local) DM Engine (local) OLAP Engine DM Engine
Why OLE DB for DM? Make DM a mass market technology by: • Leverage existing technologies and knowledge • SQL and OLE DB • Common industry wide concepts and data presentation • Changing DM market perception from “proprietary” to “open” • Increasing the number of players: • Reduce the cost and risk of becoming a consumer – one tool works with multiple providers • Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering
Integration With RDBMS • Customers would like to • Build DM models from within their RDBMS • Train the models directly off their relational tables • Perform predictions as relational queries (tables in, tables out) • Feel that DM is a native part of their database. • Therefore… • Data mining models are relational objects • All operations on the models are relational • The language used is SQL (w/Extensions) • The effect: every DBA and VB developer can become a DM developer
Identifying the “Cases” • DM algorithms analyze “cases” • The “case” is the entity being categorized and classified • Examples • Customer credit risk analysis: Case = Customer • Product profitability analysis: Case = Product • Promotion success analysis: Case = Promotion • Each case encapsulate all we know about the entity
A DMM is a Table! • A DMM structure is defined as a table • Training a DMM means inserting data (pattern) into the table • Predicting from a DMM means querying the table • All information describing the case are contained in columns
Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees
Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key ) ) using microsoft_decision_trees
Training a DMM • Training a DMM means passing it data for which the attributes to be predicted are known • Multiple passes are handled internally by the provider! • Use an INSERT INTO statement • The DMM will not persist the inserted data • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] <mining model name> [(columns list)] <source data query>
INSERT INTO INSERT INTO [Plans Prediction] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [Students]
When Insert Into Is Done… • The DMM is trained • The model can be retrained • Content (rules, trees, formulas) can be explored • OLE DB Schema rowset • SELECT * FROM <dmm>.CONTENT • XML string (PMML) • Prediction queries can be executed
What are Predictions? • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values • Predictions = queries • The syntax is SQL - like • The output is a rowset • In order to predict you need: • Input data set • A trained DMM • Binding (mapping) information between the input data and the DMM
Prediction It’s a JOIN!
The Prediction Query Syntax SELECT <columns to return or predict> FROM <dmm> PREDICTION JOIN <input data set> ON <dmm column> = <dmm input column>…
Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans]) FROM [Plans Prediction] PREDICTION JOIN [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [Plans Prediction].[IQ] = [New Students].[IQ] AND ...
OLE DB for Data Mining Defines API Consumer Consumer … OLE DB for DM (API) Provider Provider Provider … OLE DB Misc. Data Source Cube RDBMS
OLEDB for DM Configuration Options Demo Consumers OLEDB for DM Providers MS Analysis Manager ANGOSS Controls 2 3 1 4 MS DM Provider ANGOSS DM Provider
Demo on OLE DB for DM API using Angoss Controls and Provider
For more info… • DM URL • www.microsoft.com/data/oledb • www.microsoft.com/data/oledb/DMResKit.htm • News Group: • Microsoft.public.SQLserver.datamining • Communities.msn.com/AnalysisServicesDataMining • White papers: • Performance paper: www.unisys.com/windows2000/default-07.asp www.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp