80 likes | 177 Views
Midterm Review. 1-Intro. Data Mining vs. Statistics Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls With lots of data you can find anything Data privacy and security Good and bad examples. 2- EDA and Visualization.
E N D
1-Intro • Data Mining vs. Statistics • Predictive v. experimental; hypotheses vs data-driven • Different types of data • Data Mining pitfalls • With lots of data you can find anything • Data privacy and security • Good and bad examples
2- EDA and Visualization • Good visualization is good analysis • Examples of vis • 1-d, 2-d, multivariate • Histograms, boxplots, scatterplots, density estimates, etc • Overplotting with many points • Conditional plots (small multiples) • Good, bad examples
3- Data mining concepts • Preparing data for analysis • How to deal with missing data? • What are good transformations? • How to deal with outliers • Data reduction • Reducing n: sampling, subsetting • Reducing p: • Principal components: finding projections that preserve variance • Scree plot shows how much variance is accounted for in the PC • MDS: • Needs a distance matrix • Mimimizes ‘stress function’ • mostly used for visualization and EDA • In-vs-out of sample evaluation • In-sample: must penalize for complexity • Out-of-sample: use cross-validation to evaluate predictive performance
3- Data mining concepts • Complexity/Performance tradeoff • Evaluating Classification models • Accuracy (how many did I get right): not the best choice • Precision/recall or Sensitivity/specificity tradeoff • Selecting different thresholds for ROC curve.
4-Regression • Linear regression • What is it, what are the assumptions, how do you check them • Model selection • Exhaustive or Greedy (forward/backward selection) search • Extensions of Linear regression • Non-linear in parameters, linear in form • Generalized Linear Models • Logisitic regression • Poisson regression • Shrinkage • Ridge regression • Lasso regression • Profile plots show the trace of parameter estimates • Principal component regression • Nonparametric models • Smoothing splines
5-Classification • Categorical or binary response – ‘supervised’ learning • LDA: fit a parametric model to each class • Classification (decision) trees • Binary splits on any predictor X • Best split found algorithmically by gini or entropy to maximize purity • Best size can be found via cross validation • Can be unstable • K-Nearest Neighbors • Tradeoff of large/small k • Probabilistic models • Bayes error rate: best possible error if model is correct • Naïve Bayes • Independence assumption on p(xi|c)
6-Clustering • No response variable – ‘unsupervised’ learning • Needs distance measures • Euclidean, cosine, jaccard, edit, ordinal and categorical • K-means • Select initial solution • Classify points, than re-calculate means • Hierarchical clustering • Solutions for all k from 1 to n • Dendrogram effective visualization • Different distance functions (links) will result in different clusterings • Probabilistic • Mixture models fit using EM algorithm • Model based clustering