540 likes | 687 Views
欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排. 微软 SQL 数据挖掘技术概述 左洪 微软公司 数据仓库在电信的应用 贝志城 明天高科 数据挖掘在 CRM 中的应用 王立军 中圣公司 灵通 IT Service 维护管理服务系统 邹雄文 广州灵通. Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司. Agenda. What is Data Mining The Data Mining Market
E N D
欢迎光临微软SQL数据挖掘/数据仓库技术研讨会欢迎光临微软SQL数据挖掘/数据仓库技术研讨会
今日安排 • 微软SQL数据挖掘技术概述 • 左洪 微软公司 • 数据仓库在电信的应用 • 贝志城 明天高科 • 数据挖掘在CRM中的应用 • 王立军 中圣公司 • 灵通IT Service维护管理服务系统 • 邹雄文 广州灵通
Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司
Agenda • What is Data Mining • The Data Mining Market • OLE DB for Data Mining • Overview of the Data Mining Features in SQL Server 2000 • Q&A
What is DM? • A process of data exploration and analysis using automatic or semi-automatic means • “Exploring data” – scanning samples of known facts about “cases”. • “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… • Once the “knowledge” is extracted it: • Can be browsed • Provides a very useful insight on the cases behavior • Can be used to predict values of other cases • Can serve as a key element in closed loop analysis
IQ=High IQ=Low Attend College: 79% Yes 11% No Attend College: 45% Yes 55% No Wealth = False Parents Encourage = Yes Parents Encourage = No Wealth = True Attend College: 70% Yes 30% No Attend College: 31% Yes 69% No Attend College: 94% Yes 6% No Attend College: 69% Yes 21% No The deciding factors for high school students to attend college are… All Students Attend College: 55% Yes 45% No IQ ? Wealth Parents Encourage?
Business Oriented DM Problems • Targeted ads • “What banner should I display to this visitor?” • Cross sells • “What other products is this customer likely to buy? • Fraud detection • “Is this insurance claim a fraud?” • Churn analysis • “Who are those customers likely to churn?” • Risk Management • “Should I approve the loan to this customer?” • …
Data To Predict Training Data Mining Model Mining Model Mining Model Predicted Data Mining Process - Illustrated DM Engine DM Engine
The $$$: Y2000 Market Size • DM Tools Market: $250M • 40% - license fees • 60% consulting * Gartner
The Players • Leading vendors • SAS • SPSS • IBM • Hundreds of smaller vendors offering DM algorithms… • Oracle –Thinking Machines acquisition
The Products • End-to-end Data Mining tools • Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. • The customer is the power-analyst • PhD in statistics is usually required… • Closed tools – no standard API • Total vendor lock-in • Limited integration with applications • DM an “outsider” in the Data Warehouse • Extensive consulting required • Sky rocketing prices • $60K+ for a single user license
What the analysts say… • “Stand-alone Data Mining Is Dead” - Forrester • “The demise of [stand alone] data mining” – Gartner
DataPro Users Survey 1999-2001 “Data mining will be the fastest-growing BI technology…”
The $$$: 2000 Market Size • DM Applications Market Size: $1.5B * IDC
SQL Server 2000 - The Analysis Platform • SQL 2000 provides a complete Analysis Platform • Not an isolated, stand alone DM product • Platform means: • The infrastructure for applications • Not an application by itself • Integrated vision for all technologies, tools • Standard based API’s (OLE DB for DM) • Extensible • Scaleable
Reports & Analysis Data Flow DM Apps DM OLTP DW OLAP
DM Analysis Services 2000 - Architecture Manager UI DM Wizards DM DTS Task DSO DMM OLE DB OLAP Client Analysis Server OLAP Engine (local) DM Engine (local) OLAP Engine DM Engine Ext. Ext.
Why OLE DB for DM? Make DM a mass market technology by: • Leverage existing technologies and knowledge • SQL and OLE DB • Common industry wide concepts and data presentation • Changing DM market perception from “proprietary” to “open” • Increasing the number of players: • Reduce the cost and risk of becoming a consumer – one tool works with multiple providers • Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering • Dramatically increase the number of DM developers
Integration With RDBMS • Customers would like to • Build DM models from within their RDBMS • Train the models directly off their relational tables • Perform predictions as relational queries (tables in, tables out) • Feel that DM is a native part of their database. • Therefore… • Data mining models are relational objects • All operations on the models are relational • The language used is SQL (w/Extensions) • The effect: every DBA and VB developer can become a DM developer
Identifying the “Cases” • DM algorithms analyze “cases” • The “case” is the entity being categorized and classified • Examples • Customer credit risk analysis: Case = Customer • Product profitability analysis: Case = Product • Promotion success analysis: Case = Promotion • Each case encapsulate all we know about the entity
A DMM is a Table! • A DMM structure is defined as a table • Training a DMM means inserting data into the table • Predicting from a DMM means querying the table • All information describing the case are contained in columns
Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees
Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key ) ) using microsoft_decision_trees
Training a DMM • Training a DMM means passing it data for which the attributes to be predicted are known • Multiple passes are handled internally by the provider! • Use an INSERT INTO statement • The DMM will not persist the inserted data • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] <mining model name> [(columns list)] <source data query>
INSERT INTO INSERT INTO [Plans Prediction] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [CollegePlans]
When Insert Into Is Done… • The DMM is trained • The model can be retrained • Content (rules, trees, formulas) can be explored • OLE DB Schema rowset • SELECT * FROM <dmm>.CONTENT • XML string (PMML) • Prediction queries can be executed
What are Predictions? • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values • Predictions = queries • The syntax is SQL - like • The output is a rowset • In order to predict you need: • Input data set • A trained DMM • Binding (mapping) information between the input data and the DMM • Specification of what to predict
Prediction It’s a JOIN!
The Prediction Query Syntax SELECT <columns to return or predict> FROM <dmm> PREDICTION JOIN <input data set> ON <dmm column> = <dmm input column>…
Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans]) FROM [Plans Prediction] PREDICTION JOIN [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [Plans Prediction].[IQ] = [New Students].[IQ] AND ...
OLE DB DM Sample Provider with Source • All required OLE DB objects, such as session, command, and rowset • The OLE DB for Data Mining syntax parser • Tokenization of input data • Query processing engine • A sample Naïve Bayes algorithm • Model persistence in XML and binary formats • Available at www.microsoft.com/data/oledb/DMResKit.htm
Why Use DM with OLAP • Relational DM is designed for: • Reports of patterns • Batch predictions fed into an OLTP system • Real-time singleton prediction in an operational environment • OLAP is designed for • interactive analysis by a knowledge worker • Consistent and convenient navigational model • Pre-aggregations of OLAP allow faster performance
Understanding DM Content – Decision Trees Customers having high debt and college education: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = “High” And Customers.CurrentMember.Properties(“Education”) = “College”) All Customers Credit Risk: 65% Good 35% Bad Debt ? Customers having low debt and are self employed: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = Low And Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”) Debt=Low Debt=High Credit Risk: 89% Good 11% Bad Credit Risk: 45% Good 55% Bad Employ- -ment Type? Education? ET = Self Employed Education= College Education= High School ET = Salaried Credit Risk: 70% Good 30% Bad Credit Risk: 31% Good 69% Bad Credit Risk: 94% Good 6% Bad Credit Risk: 79% Good 21% Bad
…Equivalent DM Dimension All Customers Customers with low debt Customers with low debt and self employed Customers with low debt and salaried Customers with high debt Customers with high debt and college education Customers with high debt and high school education
Tree = Dimension • Every node on the tree is a dimension member • The node statistics are the member properties • All members are calculated • Formula aggregates the case dimension members that apply to this node • The MDX is generated by the DM algorithm • Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube • Applies to • Classification (decision trees) • Segmentation (clusters)
Browsing the Virtual Cube • Pivot the DM dimension: Credit Risk: 70% Good, 30% Bad