1 / 8

Midterm Review

Midterm Review. 1-Intro. Data Mining vs. Statistics Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls With lots of data you can find anything Data privacy and security Good and bad examples. 2- EDA and Visualization.

olinda
Download Presentation

Midterm Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Midterm Review

  2. 1-Intro • Data Mining vs. Statistics • Predictive v. experimental; hypotheses vs data-driven • Different types of data • Data Mining pitfalls • With lots of data you can find anything • Data privacy and security • Good and bad examples

  3. 2- EDA and Visualization • Good visualization is good analysis • Examples of vis • 1-d, 2-d, multivariate • Histograms, boxplots, scatterplots, density estimates, etc • Overplotting with many points • Conditional plots (small multiples) • Good, bad examples

  4. 3- Data mining concepts • Preparing data for analysis • How to deal with missing data? • What are good transformations? • How to deal with outliers • Data reduction • Reducing n: sampling, subsetting • Reducing p: • Principal components: finding projections that preserve variance • Scree plot shows how much variance is accounted for in the PC • MDS: • Needs a distance matrix • Mimimizes ‘stress function’ • mostly used for visualization and EDA • In-vs-out of sample evaluation • In-sample: must penalize for complexity • Out-of-sample: use cross-validation to evaluate predictive performance

  5. 3- Data mining concepts • Complexity/Performance tradeoff • Evaluating Classification models • Accuracy (how many did I get right): not the best choice • Precision/recall or Sensitivity/specificity tradeoff • Selecting different thresholds for ROC curve.

  6. 4-Regression • Linear regression • What is it, what are the assumptions, how do you check them • Model selection • Exhaustive or Greedy (forward/backward selection) search • Extensions of Linear regression • Non-linear in parameters, linear in form • Generalized Linear Models • Logisitic regression • Poisson regression • Shrinkage • Ridge regression • Lasso regression • Profile plots show the trace of parameter estimates • Principal component regression • Nonparametric models • Smoothing splines

  7. 5-Classification • Categorical or binary response – ‘supervised’ learning • LDA: fit a parametric model to each class • Classification (decision) trees • Binary splits on any predictor X • Best split found algorithmically by gini or entropy to maximize purity • Best size can be found via cross validation • Can be unstable • K-Nearest Neighbors • Tradeoff of large/small k • Probabilistic models • Bayes error rate: best possible error if model is correct • Naïve Bayes • Independence assumption on p(xi|c)

  8. 6-Clustering • No response variable – ‘unsupervised’ learning • Needs distance measures • Euclidean, cosine, jaccard, edit, ordinal and categorical • K-means • Select initial solution • Classify points, than re-calculate means • Hierarchical clustering • Solutions for all k from 1 to n • Dendrogram effective visualization • Different distance functions (links) will result in different clusterings • Probabilistic • Mixture models fit using EM algorithm • Model based clustering

More Related