480 likes | 662 Views
DAT205 Advanced Data Mining Using SQL Server 2000. ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation. Agenda. Microsoft Data Mining Algorithms OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Customer Segmentation
E N D
DAT205Advanced Data Mining Using SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation
Agenda • Microsoft Data Mining Algorithms • OLE DB for DM Data mining query • Data Mining Case Study: Click Stream Analysis • Customer Segmentation • Site affiliation • Target ads in banner • Performance of Microsoft Data Mining Algorithm • Q&A
All Students Attend College: 55% Yes 45% No IQ=High IQ < > High Attend College: 79% Yes 21% No Attend College: 35% Yes 65% No Parent Income = Low Parent Income = High Attend College: 69% Yes 31% No Attend College: 94% Yes 6% No Decision Tree • Popular technique for classification, Prediction task • Churn analysis • Credit risk analysis • … • Easy to understand • any path from node to leaf forms a rule • Fast to build • Prediction based on leaf node stats • Variation: C4.5, C5, CART, Chaid
1000 1800 900 1600 800 1400 700 1200 600 1000 500 800 400 300 600 200 400 Yes 100 200 0 0 IQ=High IQ=Medium IQ=Low PE=TRUE PE=FALSE 1200 1800 1600 1000 1400 800 1200 1000 600 800 No 400 600 400 200 200 0 0 Male Female PI=High PI=FALSE How tree works
All Students College Plan 33% Yes 67% No Parent Encouragement = True Parent Encouragement = False College Plan 63% Yes 37% No College Plan 16% Yes 84% No Split recursively
Microsoft Decision Trees • Probabilistic Classification Tree • Splitting methods: Bayesian score and Entropy • Forward pruning • Tree shape: Binary and Nary tree • Scalable framework
Clustering Algorithm (EM) • A popular method for customer segmentation, mailing list, profiling… • Algorithm process • Assign a set of Initial Points • Assign initial cluster to each points • Assign data points to each cluster with a probability • Computer new central point based on weighted computation • Cycle until convergence
X X X EM Illustration
Microsoft Clustering Algorithm (Scalable EM) Identify Data to be Compressed Build/Update Model Fill Buffer Stop? Data Compressed date Sufficient stats Final Model
OLE DB for DM • Industry standard for data mining • Based on existing technologies • SQL • OLE DB • Define common concepts for DM • Case, Nested Case • Mining Model • Model Creation • Model Training • Prediction • Language based API
DM Query Language Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risk text discrete predict) Using Microsoft_Decision_Trees Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income And CreditRisk.Profession=NewCustomer.Profession
Schema Rowsets • Tabular data to provide meta data information • List of Schema Rowsets in OLE DB for DM • Mining_Services • Mining_Service_Parameters • Mining_Models • Mining_Columns • Mining_Model_Contents • Model_Content_PMML
Segmentation based on Customer table Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTime long continuous, GeoLocation text discrete ) Using Microsoft_Clustering
Segmentation based on Customer and WebClick Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTime long continuous, GeoLocation text discrete WebClick table ( UrlCategory text key ) )Using Microsoft_Clustering
Insurance Business No Loan Loan No Insurance Insurance Loan No Loan Stock No Stock Stock Loan No Business Business No Stock Stock Shopping No Shopping Insurance No Insurance Association analysis using Microsoft Decision Trees
Association analysis using Microsoft Decision Trees Insurance Business No Loan Loan No Insurance Insurance Loan No Loan Stock No Stock Stock Loan No Business Business No Stock Stock Shopping No Shopping Insurance No Insurance
Site Affiliation Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Insert into SiteAffiliation (CustomerID,WebClick (skip, UrlCategory)OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass' , 'Shape{Select CustomerID from Customer} Append ( {Select customerid, URLCategory from WebClick } relate CustomerID to CustomerID) as WebClick’ )
Singleton Prediction • Select Flattened • Topcount((select URLCategory, $adjustedProbability as prob • From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) • From • WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input • On • WebLog.[Web Click].URLCategory = input.WebClick.URLCategory
DM Provider Web Customer IIS Internet ASP DMM Architecture ADO/DSO Real Time Prediction
DM Performance Study • Joint effort between Unisys & Microsoft • Two parts of the white paper: • First part: Use AS2k to build DM Models for a banking business scenario • Second Part: Performance results of DM algorithms study • Some results in this session… • Details in the paper and SQL Server magazine articles…
Non Nested: Training Times for varying Number of Input attributes • Assumptions: • 1 mm cases • 25 states • 1 predictable attribute Observations:
Non Nested: Training Times for varying Number of Cases • Assumptions: • 20 attributes • 25 states • 1 predictable attribute Observations:
Nested Cases: Training Times for varying Sample size of Case Table Assumptions: • Avg. customer purchases=25 • States in nested=200 • Nested key predictable Observations:
Nested Cases: Training Times for varying Number of Products purchased per customer Assumptions: • 200000 cases • 1000 products in nested Observations:
For more info… • DM URL • www.microsoft.com/data/oledb • www.microsoft.com/data/oledb/DMResKit.htm • News Group: • Microsoft.public.SQLserver.datamining • Communities.msn.com/AnalysisServicesDataMining • White papers: • Performance paper: www.unisys.com/windows2000/default-07.asp www.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp
Don’t forget to complete the on-line Session Feedback form on the Attendee Web sitehttps://web.mseventseurope.com/teched/