1 / 92

Maximizing Analytical Models: A Data Scientist's Guide

Dive into the world of analytics with this comprehensive guide covering model creation, preprocessing data, types of analytics, and evaluating predictive models. Learn the critical success factors to improve ROI and secure your data.

jamesglover
Download Presentation

Maximizing Analytical Models: A Data Scientist's Guide

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analytics

  2. Introduction • Analytics Process Model • Example Analytics Applications • Data Scientist Job Profile • Data Preprocessing • Types of Analytics • Post Processing of Analytical Models • Critical Success Factors for Analytical Models • Economic Perspective On Analytics • Improving the ROI of Analytics • Privacy and Security

  3. Analytics Process Model

  4. Example Analytics Applications • Risk analytics • credit scoring • fraud detection • Marketing analytics • churn prediction • response modeling • customer segmentation • Recommender systems • Texts analytics

  5. Example Analytics Applications

  6. Data Scientist Job Profile • Statistics, machine learning and/or quantitative modeling • Programming • Communication/Visualization • Business Knowledge • Creativity

  7. Data Preprocessing • Denormalizing data for analysis • Sampling • Exploratory Analysis • Missing values • Outlier Detection and Handing

  8. Denormalizing data for analysis

  9. Sampling • Take a subset of historical data to build analytical model • Good sample should be representative for the future entities on which the analytical model will be run • Choosing the optimal time window of the sample involves a trade-off between lots of data and recent data

  10. Exploratory Analysis

  11. Exploratory Analysis • Descriptive statistics • Mean • Median • Mode • Standard deviation • Percentile values

  12. Missing Values

  13. Missing Values • Keep • Delete (observation or variable) • Replace (aka impute)

  14. Outlier Detection and Handling • Valid versus invalid observations • Outlier detection • Minimum/Maximum • Histogram, box plot, scatter plot • Outlier handling • Treat as missing value (invalid observation) • Capping (valid observation)

  15. Types of Analytics • Predictive Analytics • Evaluating Predictive Models • Descriptive Analytics • Social Network Analytics

  16. Predictive Analytics • Linear Regression • Logistic Regression • Decision Trees • Other predictive analytics techniques

  17. Linear Regression

  18. Linear Regression

  19. Logistic Regression

  20. Logistic Regression

  21. Logistic Regression

  22. Logistic Regression

  23. Decision Trees

  24. Decision Trees • Splitting decision • Which variable to split at what value • Stopping decision • When to stop adding nodes to the tree? • Assignment decision • What class to assign to a leaf node?

  25. Decision Trees • Entropy: E(S) = -pGlog2(pG)-pBlog2(pB) (C4.5/See5) • Gini: Gini(S) = 2pGpB (CART)

  26. Decision Trees

  27. Decision Trees • Entropy top node = -1/2 x log2(1/2) – 1/2 x log2(1/2) = 1 • Entropy left node = -1/3 x log2(1/3) – 2/3 x log2(2/3) = 0.91 • Entropy right node = -1 x log2(1) – 0 x log2(0) = 0 • Gain = 1 – (600/800) x 0.91 – (200/800) x 0 = 0.32

  28. Decision Trees

  29. Decision Trees

  30. Decision Trees • Regression trees

  31. Decision Trees

  32. Other predictive analytics techniques • Ensemble methods • Bagging, Boosting, Random Forests • Neural Networks • Support Vector Machines • Deep Learning • Trade-off between model performance and interpretability!

  33. Evaluating Predictive Models • Splitting up the data set • Performance Measures for Classification Models • Performance Measures for Regression Models • Other Performance Measures for Predictive Analytical Models

  34. Splitting up the data set TRAIN/TEST DATA

  35. Splitting up the data set CROSS-VALIDATION

  36. Splitting up the data set BOOTSTRAPPING

  37. Performance Measures for Classification Models

  38. Performance Measures for Classification Models • Classification accuracy = (TP+TN)/(TP+FP+FN+TN) = 3/5 • Classification error = (FP +FN)/(TP+FP+FN+TN) = 2/5 • Sensitivity = Recall = Hit rate = TP/(TP+FN) = 1/2 • Specificity = TN/(FP+TN) = 2/3 • Precision = TP/(TP+FP) = 1/2 • F-measure = 2 * (Precision * Recall)/(Precision +Recall) = 1/2

  39. Performance Measures for Classification Models AUC represents probability that randomly chosen churner gets higher score than randomly chosen non-churner!

  40. Performance Measures for Classification Models

  41. Performance Measures for Classification Models Note: AUC : AR=2*AUC-1

  42. Performance Measures for Classification Models

  43. Performance Measures for Regression Models

  44. Performance Measures for Regression Models

  45. Other Performance Measures for Predictive Analytical Models • Comprehensibility • Justifiability • Operational efficiency

  46. Descriptive Analytics • Association rules • Sequence rules • Clustering

  47. Association Rules

  48. Association Rules

  49. Association Rules • Post processing • Filter out trivial rules • Sensitivity analysis • Visualization • Measure economic impact

  50. Sequence Rules Session 1: A, B, CSession 2: B, CSession 3: A, C, D Session 4: A, B, D Session 5: D, C, A Calculate confidence and support as with association rules!

More Related