220 likes | 373 Views
Data Mining ( and Machine Learning ) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014. Why Reinvent a Toilet?. Definitions. What Do You Think?. Linear Regression is a straight line describing how variable Y responds to changes in variable X. Is Linear Regression?
E N D
Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014
What Do You Think? Linear Regression is a straight line describing how variable Y responds to changes in variable X Is Linear Regression? • Data Mining • Machine Learning • Statistics • All of the above
MS DM Environment • SQL Server 2000 - 2014 • Excel Data Mining Add-Ins (optional, recommended) • Interact with: Excel (add-ins), SQL Management Studio, SQL Server Data Tools (SSDT), Custom Code SSDT Custom Code
Many Potential Questions Generic question: What are the data patterns? Best if more specific and directed at a problem, for example: MS DM Capabilities • How do we combine our products to increase profits? • How do we predict the demand for a product / service? • Why are customers buying from us? • Where can we best cut costs? • What are the opportunities to reduce risks? • Who are our best customers? • …
Approach More Technical • Define problem / questions • Prepare data • Build model • Validate model • Implement predictions • Automate model refresh • Extend / custom applications
Predict Using Models DMX = Data Mining Extensions to query models for predictions DMX Query: Output: SELECT Model.[Bike Buyer], PredictProbability( Model.[Bike Buyer]), NewData.Email FROM [Model] NATURAL PREDICTION JOIN (SELECT Age, [Commute Distance], Email FROM … ) As NewData
SQL Server Data Mining Algorithms Clustering Decision Tree Linear Regression Sequence Clustering Association Time Series Naive Bayes Neural Network Text Mining • Fuzzy Grouping • Term Extraction • Term Lookup
Interesting Links • Sources of free data for research • https://opendata.socrata.com • http://datamarket.azure.com • http://aws.amazon.com/datasets • http://www.google.com/publicdata/directory • Algorithms • http://msdn.microsoft.com/en-us/library/ms174879.aspx • http://research.microsoft.com/apps/pubs/default.aspx?id=69669 • http://academic.research.microsoft.com/Paper/4499824 • http://academic.research.microsoft.com/Paper/226089.aspx • http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/ • http://en.wikipedia.org/wiki/Expectation-maximization_algorithm • http://axon.cs.byu.edu/Dan/678/papers/Cluster/Xu.pdf • http://www.epa.gov/bioiweb1/statprimer/tableall.html#multivclustr • http://research.microsoft.com/pubs/69669/tr-98-35.pdf • http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html • http://en.wikipedia.org/wiki/Expectation-maximization_algorithm • http://msdn.microsoft.com/en-us/library/dd299424(v=SQL.100).aspx • http://msdn.microsoft.com/en-us/library/cc280445.aspx • http://www.sqlserverdatamining.com/ssdm/Home/DataMiningAddinsLaunch/tabid/69/Default.aspx
Useful Terms • Population is a group of use cases • Valid: purchasers = customers who purchased items • Questionable: purchasers = customers who indicated in survey that they would buy an item; actual here – customers who answered surveys, intent does not indicate behavior, pus sample may be insufficient • Sample random subset of data. Correct sample size selection requires knowledge of data. • Range all values including exceptions and outliers • Bias incorrect results, often form incorrect non-random sample selection, i.e. selecting Seattle to represent WA • Mean or average sum of values / number of samples • Distribution frequency of a value, typically arranged as a graph around mean • Variance = • Standard Deviation = • Correlation variable changes as a result of change to another var. • Overfitting model accurately fit sample, but not real world • Underfitting model is not able to establish a useful pattern • Cross validation checking model on a subset of inputs not used in model generation