1 / 20

Data Mining ( and Machine Learning ) With Microsoft Tools Michael Lisin, Plaster Group

Data Mining ( and Machine Learning ) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014. Why Reinvent a Toilet?. Definitions. What Do You Think?. Linear Regression is a straight line describing how variable Y responds to changes in variable X. Is Linear Regression?

Download Presentation

Data Mining ( and Machine Learning ) With Microsoft Tools Michael Lisin, Plaster Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014

  2. Why Reinvent a Toilet?

  3. Definitions

  4. What Do You Think? Linear Regression is a straight line describing how variable Y responds to changes in variable X Is Linear Regression? • Data Mining • Machine Learning • Statistics • All of the above

  5. MS DM Environment • SQL Server 2000 - 2014 • Excel Data Mining Add-Ins (optional, recommended) • Interact with: Excel (add-ins), SQL Management Studio, SQL Server Data Tools (SSDT), Custom Code SSDT Custom Code

  6. Start With a Question

  7. Many Potential Questions Generic question: What are the data patterns? Best if more specific and directed at a problem, for example: MS DM Capabilities • How do we combine our products to increase profits? • How do we predict the demand for a product / service? • Why are customers buying from us? • Where can we best cut costs? • What are the opportunities to reduce risks? • Who are our best customers? • …

  8. Approach More Technical • Define problem / questions • Prepare data • Build model • Validate model • Implement predictions • Automate model refresh • Extend / custom applications

  9. SQL DM Algorithms Summary

  10. Predict Using Models DMX = Data Mining Extensions to query models for predictions DMX Query: Output: SELECT Model.[Bike Buyer], PredictProbability( Model.[Bike Buyer]), NewData.Email FROM [Model] NATURAL PREDICTION JOIN (SELECT Age, [Commute Distance], Email FROM … ) As NewData

  11. Demo

  12. Questionsmichaell@plastergroup.com

  13. Appendix

  14. SQL Server Data Mining Algorithms Clustering Decision Tree Linear Regression Sequence Clustering Association Time Series Naive Bayes Neural Network Text Mining • Fuzzy Grouping • Term Extraction • Term Lookup

  15. Key SQL Server Algorithms - 1

  16. Key SQL Server Algorithms - 2

  17. Key SQL Server Algorithms - 3 TEXT

  18. SQL Text Mining

  19. Interesting Links • Sources of free data for research • https://opendata.socrata.com • http://datamarket.azure.com • http://aws.amazon.com/datasets • http://www.google.com/publicdata/directory • Algorithms • http://msdn.microsoft.com/en-us/library/ms174879.aspx • http://research.microsoft.com/apps/pubs/default.aspx?id=69669 • http://academic.research.microsoft.com/Paper/4499824 • http://academic.research.microsoft.com/Paper/226089.aspx • http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/ • http://en.wikipedia.org/wiki/Expectation-maximization_algorithm • http://axon.cs.byu.edu/Dan/678/papers/Cluster/Xu.pdf • http://www.epa.gov/bioiweb1/statprimer/tableall.html#multivclustr • http://research.microsoft.com/pubs/69669/tr-98-35.pdf • http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html • http://en.wikipedia.org/wiki/Expectation-maximization_algorithm • http://msdn.microsoft.com/en-us/library/dd299424(v=SQL.100).aspx • http://msdn.microsoft.com/en-us/library/cc280445.aspx • http://www.sqlserverdatamining.com/ssdm/Home/DataMiningAddinsLaunch/tabid/69/Default.aspx

  20. Useful Terms • Population is a group of use cases • Valid: purchasers = customers who purchased items • Questionable: purchasers = customers who indicated in survey that they would buy an item; actual here – customers who answered surveys, intent does not indicate behavior, pus sample may be insufficient • Sample random subset of data. Correct sample size selection requires knowledge of data. • Range all values including exceptions and outliers • Bias incorrect results, often form incorrect non-random sample selection, i.e. selecting Seattle to represent WA • Mean or average sum of values / number of samples • Distribution frequency of a value, typically arranged as a graph around mean • Variance = • Standard Deviation = • Correlation variable changes as a result of change to another var. • Overfitting model accurately fit sample, but not real world • Underfitting model is not able to establish a useful pattern • Cross validation checking model on a subset of inputs not used in model generation

More Related